date:20161212

[GitHub] spark issue #16264: [SPARK-18793] [SPARK-18794] [R] add spark.randomForest/s...

2016-12-12 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16264
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70069/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16264: [SPARK-18793] [SPARK-18794] [R] add spark.randomForest/s...

2016-12-12 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16264
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16264: [SPARK-18793] [SPARK-18794] [R] add spark.randomForest/s...

2016-12-12 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16264
  
**[Test build #70069 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70069/consoleFull)**
 for PR 16264 at commit 
[`fb1933d`](https://github.com/apache/spark/commit/fb1933d22bfd8294b38c7f0e712c0f753ce56ba6).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16264: [SPARK-18793] [SPARK-18794] [R] add spark.randomForest/s...

2016-12-12 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16264
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70070/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16264: [SPARK-18793] [SPARK-18794] [R] add spark.randomForest/s...

2016-12-12 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16264
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16264: [SPARK-18793] [SPARK-18794] [R] add spark.randomForest/s...

2016-12-12 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16264
  
**[Test build #70070 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70070/consoleFull)**
 for PR 16264 at commit 
[`be9c846`](https://github.com/apache/spark/commit/be9c8466209a0d656ce9b7bb08ee94014391ea0d).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16119: [SPARK-18687][Pyspark][SQL]Backward compatibility - crea...

2016-12-12 Thread vijoshi

Github user vijoshi commented on the issue:

https://github.com/apache/spark/pull/16119
  
@holdenk reminder 
also pinging @davies 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16161: [SPARK-18717][SQL] Make code generation for Scala...

2016-12-12 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/16161


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16161: [SPARK-18717][SQL] Make code generation for Scala Map wo...

2016-12-12 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/16161
  
thanks, merging to master!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16264: [SPARK-18793] [SPARK-18794] [R] add spark.randomForest/s...

2016-12-12 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16264
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70068/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16264: [SPARK-18793] [SPARK-18794] [R] add spark.randomForest/s...

2016-12-12 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16264
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16264: [SPARK-18793] [SPARK-18794] [R] add spark.randomForest/s...

2016-12-12 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16264
  
**[Test build #70068 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70068/consoleFull)**
 for PR 16264 at commit 
[`292745b`](https://github.com/apache/spark/commit/292745b5ea8fc7815db0bcc1be995cae0a33b0ec).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16264: [SPARK-18793] [SPARK-18794] [R] add spark.randomF...

2016-12-12 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/16264#discussion_r92112001
  
--- Diff: R/pkg/vignettes/sparkr-vignettes.Rmd ---
@@ -539,7 +539,7 @@ In the following example, we use the `longley` dataset 
to train a random forest
 
 ```{r}
 df <- createDataFrame(longley)
-rfModel <- spark.randomForest(df, Employed ~ ., type = "regression", 
numTrees = 5)
+rfModel <- spark.randomForest(df, Employed ~ ., type = "regression", 
maxDepth = 2, numTrees = 2)
--- End diff --

you could also do this to limit output

```{r, include=FALSE}
ops <- options()
options(max.print=40)
```

there is another example in the vignettes


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16264: [SPARK-18793] [SPARK-18794] [R] add spark.randomForest/s...

2016-12-12 Thread felixcheung

Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/16264
  
and if we are doing "(Added in Spark 2.1.0)" in 
https://github.com/apache/spark/pull/16222/files perhaps we should have the 
same for these 2?

actually, I'd vote for removing them - vignettes is for that specific 
version of package you install. If you install SparkR 2.1.0 what is described 
there is in 2.1.0, by default




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16263: [SPARK-18281][SQL][PySpark] Consumes the returned local ...

2016-12-12 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16263
  
**[Test build #70071 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70071/consoleFull)**
 for PR 16263 at commit 
[`5ba5d11`](https://github.com/apache/spark/commit/5ba5d1163802a5250510491f97aa9592c785596f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16259: [Minor][SparkR]:fix kstest example error and add unit te...

2016-12-12 Thread felixcheung

Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/16259
  
LGTM. Let's trigger appveyor again like Hyukjin says



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16259: [Minor][SparkR]:fix kstest example error and add ...

2016-12-12 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/16259#discussion_r92111051
  
--- Diff: R/pkg/R/mllib.R ---
@@ -1595,14 +1595,14 @@ setMethod("write.ml", signature(object = 
"ALSModel", path = "character"),
 #' \dontrun{
 #' data <- data.frame(test = c(0.1, 0.15, 0.2, 0.3, 0.25))
 #' df <- createDataFrame(data)
-#' test <- spark.ktest(df, "test", "norm", c(0, 1))
+#' test <- spark.kstest(df, "test", "norm", c(0, 1))
--- End diff --

yap, I saw this recently too but forgot to follow up. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16264: [SPARK-18793] [SPARK-18794] [R] add spark.randomForest/s...

2016-12-12 Thread felixcheung

Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/16264
  
LGTM. maybe good to have one example for classification but optional


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16247: [SPARK-18817][SparkR] set default spark-warehouse path t...

2016-12-12 Thread felixcheung

Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/16247
  
Possibly,  SPARK-16027 was just a hack, the root issue remains I think


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15819: [SPARK-18372][SQL][Branch-1.6].Staging directory fail to...

2016-12-12 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/15819
  
The current fix does not resolve the issue when users hitting abnormal 
termination of JVM. In addition, if the JVM does not stop, these temporary 
files could consume a lot of spaces. Thus, I think 
https://github.com/apache/spark/pull/16134 needs to be added too.

This is just my opinion. Also need to get the feedbacks from the other 
Committers.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16259: [Minor][SparkR]:fix kstest example error and add unit te...

2016-12-12 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/16259
  
(@wangmiao1981 you could close and re-open this to re-trigger the build as 
we already talk before if you or any cares).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16264: [SPARK-18793] [SPARK-18794] [R] add spark.randomForest/s...

2016-12-12 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16264
  
**[Test build #70070 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70070/consoleFull)**
 for PR 16264 at commit 
[`be9c846`](https://github.com/apache/spark/commit/be9c8466209a0d656ce9b7bb08ee94014391ea0d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16264: [SPARK-18793] [SPARK-18794] [R] add spark.randomForest/s...

2016-12-12 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16264
  
**[Test build #70069 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70069/consoleFull)**
 for PR 16264 at commit 
[`fb1933d`](https://github.com/apache/spark/commit/fb1933d22bfd8294b38c7f0e712c0f753ce56ba6).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16262: [SPARK-17932][SQL][FOLLOWUP] Change statement `SHOW TABL...

2016-12-12 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/16262
  
LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16264: [SPARK-18793] [R] add spark.randomForest to vignettes

2016-12-12 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16264
  
**[Test build #70068 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70068/consoleFull)**
 for PR 16264 at commit 
[`292745b`](https://github.com/apache/spark/commit/292745b5ea8fc7815db0bcc1be995cae0a33b0ec).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13909: [SPARK-16213][SQL] Reduce runtime overhead of a p...

2016-12-12 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/13909#discussion_r92108007
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala
 ---
@@ -56,33 +58,93 @@ case class CreateArray(children: Seq[Expression]) 
extends Expression {
   }
 
   override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
-val arrayClass = classOf[GenericArrayData].getName
-val values = ctx.freshName("values")
-ctx.addMutableState("Object[]", values, s"this.$values = null;")
+val array = ctx.freshName("array")
 
-ev.copy(code = s"""
-  this.$values = new Object[${children.size}];""" +
+val et = dataType.elementType
+val evals = children.map(e => e.genCode(ctx))
+val isPrimitiveArray = ctx.isPrimitiveType(et)
+val primitiveTypeName = if (isPrimitiveArray) 
ctx.primitiveTypeName(et) else ""
+val (preprocess, arrayData, arrayWriter) =
+  GenArrayData.getCodeArrayData(ctx, et, children.size, 
isPrimitiveArray, array)
+
+ev.copy(code =
+  preprocess +
   ctx.splitExpressions(
 ctx.INPUT_ROW,
-children.zipWithIndex.map { case (e, i) =>
-  val eval = e.genCode(ctx)
-  eval.code + s"""
-if (${eval.isNull}) {
-  $values[$i] = null;
+evals.zipWithIndex.map { case (eval, i) =>
+  eval.code +
+(if (isPrimitiveArray) {
+  (if (!children(i).nullable) {
+s"\n$arrayWriter.write($i, ${eval.value});"
+  } else {
+s"""
+if (${eval.isNull}) {
--- End diff --

I think branch prediction can work very well for this case, we don't need 
to manually optimize the code.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16264: SPARK-18792] [R] add spark.randomForest to vignet...

2016-12-12 Thread mengxr

GitHub user mengxr opened a pull request:

https://github.com/apache/spark/pull/16264

SPARK-18792] [R] add spark.randomForest to vignettes

## What changes were proposed in this pull request?

Mention `spark.randomForest` in vignettes. Keep the content minimal since 
users can type `?spark.randomForest` to see the full doc.

cc: @jkbradley 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mengxr/spark SPARK-18793

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16264.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16264






---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16247: [SPARK-18817][SparkR] set default spark-warehouse path t...

2016-12-12 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/16247
  
This test failure seems related with this PR. It seems because the previous 
Hive-enabled spark session is not closed properly between `test_saprkSQL.R` and 
`test_sparkR.R` here, in particular, I suspect the lock in derby via Hive 
client. It seems apparently related with SPARK-16027?

There was only single instance of Hive-enabled spark session there but the 
test here introduces another one. I manually tested for each after removing 
each other and it seems working fine.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16247: [SPARK-18817][SparkR] set default spark-warehouse path t...

2016-12-12 Thread felixcheung

Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/16247
  
re: test failure. it might be related to this change? the call stack is 
hidden, it should be trying to call into 
https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionState.scala
 and failed.

Is it possible that tmpdir() is not writeable on Jenkins?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16222: [SPARK-18797][SparkR]:Update spark.logit in spark...

2016-12-12 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/16222


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15314: [SPARK-17747][ML] WeightCol support non-double numeric d...

2016-12-12 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15314
  
**[Test build #70067 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70067/consoleFull)**
 for PR 15314 at commit 
[`20e77fb`](https://github.com/apache/spark/commit/20e77fbc29bcba0238edc1c1368f984f44f7776c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16222: [SPARK-18797][SparkR]:Update spark.logit in sparkr-vigne...

2016-12-12 Thread mengxr

Github user mengxr commented on the issue:

https://github.com/apache/spark/pull/16222
  
LGTM. Merged into master and branch-2.1. I will change the `regParam` value 
in a follow-up PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16149: [SPARK-18715][ML]Fix AIC calculations in Binomial GLM

2016-12-12 Thread actuaryzhang

Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/16149
  
@srowen @sethah 
One more commit that adds a test case with `weight = 4.7` which will round 
up to 5 to test the case @sethah described. All tests passed.  I'm pretty sure 
R's rounding is the same as what I'm doing here. Please merge if there is no 
other issue. Thanks. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16142: [SPARK-18716][CORE] Restrict the disk usage of spark eve...

2016-12-12 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16142
  
**[Test build #70066 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70066/consoleFull)**
 for PR 16142 at commit 
[`96e9063`](https://github.com/apache/spark/commit/96e906326327c96f0c6d07bba14c46ac46e86ae0).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16220: [SPARK-18796][SS]StreamingQueryManager should not...

2016-12-12 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/16220


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15915: [SPARK-18485][CORE] Underlying integer overflow when cre...

2016-12-12 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15915
  
**[Test build #70065 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70065/consoleFull)**
 for PR 15915 at commit 
[`c48b8f9`](https://github.com/apache/spark/commit/c48b8f976e812d947f0cc54dfc159ac55dfb6211).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16142: [SPARK-18716][CORE] Restrict the disk usage of spark eve...

2016-12-12 Thread uncleGen

Github user uncleGen commented on the issue:

https://github.com/apache/spark/pull/16142
  
retest please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15717: [SPARK-17910][SQL] Allow users to update the comment of ...

2016-12-12 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15717
  
**[Test build #70064 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70064/consoleFull)**
 for PR 15717 at commit 
[`a71683c`](https://github.com/apache/spark/commit/a71683c773a1117f58c7e2b5cb5a4b1b101326f4).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15915: [SPARK-18485][CORE] Underlying integer overflow w...

2016-12-12 Thread uncleGen

Github user uncleGen commented on a diff in the pull request:

https://github.com/apache/spark/pull/15915#discussion_r92104532
  
--- Diff: 
core/src/test/scala/org/apache/spark/util/io/ChunkedByteBufferOutputStreamSuite.scala
 ---
@@ -119,4 +119,21 @@ class ChunkedByteBufferOutputStreamSuite extends 
SparkFunSuite {
 assert(arrays(1).toSeq === ref.slice(10, 20))
 assert(arrays(2).toSeq === ref.slice(20, 30))
   }
+
+  test("negative chunk size") {
+val ref = new Array[Byte](8 * 1024 * 1024 + 10)
+Random.nextBytes(ref)
+val o = new ChunkedByteBufferOutputStream(-10, ByteBuffer.allocate)
+o.write(ref)
+o.close()
+val arrays = o.toChunkedByteBuffer.getChunks().map(_.array())
+assert(arrays.length === 3)
+assert(arrays(0).length === 4 * 1024 * 1024)
+assert(arrays(1).length === 4 * 1024 * 1024)
+assert(arrays(2).length === 10 )
+
+assert(arrays(0).toSeq === ref.slice(0, 4 * 1024 * 1024))
+assert(arrays(1).toSeq === ref.slice(4 * 1024 * 1024, 8 * 1024 * 1024))
+assert(arrays(2).toSeq === ref.slice(8 * 1024 * 1024, 8 * 1024 * 1024 
+ 10))
+  }
 }
--- End diff --

Discovery starting.
Discovery completed in 42 seconds, 124 milliseconds.
Run starting. Expected test count is: 9
ChunkedByteBufferOutputStreamSuite:
- empty output
- write a single byte
- write a single near boundary
- write a single at boundary
- single chunk output
- single chunk output at boundary size
- multiple chunk output
- multiple chunk output at boundary size
- negative chunk size
Run completed in 42 seconds, 700 milliseconds.
Total number of tests run: 9
Suites: completed 2, aborted 0
Tests: succeeded 9, failed 0, canceled 0, ignored 0, pending 0
All tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14079: [SPARK-8425][CORE] Application Level Blacklisting

2016-12-12 Thread kayousterhout

Github user kayousterhout commented on a diff in the pull request:

https://github.com/apache/spark/pull/14079#discussion_r92081435
  
--- Diff: 
core/src/test/scala/org/apache/spark/scheduler/SchedulerIntegrationSuite.scala 
---
@@ -157,8 +160,16 @@ abstract class SchedulerIntegrationSuite[T <: 
MockBackend: ClassTag] extends Spa
   }
   // When a job fails, we terminate before waiting for all the task 
end events to come in,
   // so there might still be a running task set.  So we only check 
these conditions
-  // when the job succeeds
-  assert(taskScheduler.runningTaskSets.isEmpty)
+  // when the job succeeds.
+  // When the final task of a taskset completes, we post
+  // the event to the DAGScheduler event loop before we finish 
processing in the taskscheduler
+  // thread.  Its possible the DAGScheduler thread processes the 
event, finishes the job,
+  // and notifies the job waiter before our original thread in the 
task scheduler finishes
+  // handling the event and marks the taskset as complete.  So its ok 
if we need to wait a
+  // *little* bit longer for the original taskscheduler thread to 
finish up to deal w/ the race.
+  eventually(timeout(1 second), interval(100 millis)) {
+assert(taskScheduler.runningTaskSets.isEmpty)
--- End diff --

Is this related to this PR? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14079: [SPARK-8425][CORE] Application Level Blacklisting

2016-12-12 Thread kayousterhout

Github user kayousterhout commented on a diff in the pull request:

https://github.com/apache/spark/pull/14079#discussion_r92079858
  
--- Diff: 
core/src/test/scala/org/apache/spark/scheduler/BlacklistTrackerSuite.scala ---
@@ -17,10 +17,322 @@
 
 package org.apache.spark.scheduler
 
-import org.apache.spark.{SparkConf, SparkFunSuite}
+import org.mockito.Mockito.when
+import org.scalatest.BeforeAndAfterEach
+import org.scalatest.mock.MockitoSugar
+
+import org.apache.spark._
 import org.apache.spark.internal.config
+import org.apache.spark.util.ManualClock
+
+class BlacklistTrackerSuite extends SparkFunSuite with BeforeAndAfterEach 
with MockitoSugar
+with LocalSparkContext {
+
+  private val clock = new ManualClock(0)
+
+  private var blacklist: BlacklistTracker = _
+  private var scheduler: TaskSchedulerImpl = _
+  private var conf: SparkConf = _
+
+  override def beforeEach(): Unit = {
+conf = new SparkConf().setAppName("test").setMaster("local")
+  .set(config.BLACKLIST_ENABLED.key, "true")
+scheduler = mockTaskSchedWithConf(conf)
+
+clock.setTime(0)
+blacklist = new BlacklistTracker(conf, clock)
+  }
+
+  override def afterEach(): Unit = {
+if (blacklist != null) {
+  blacklist = null
+}
+if (scheduler != null) {
+  scheduler.stop()
+  scheduler = null
+}
+super.afterEach()
+  }
+
+  val allExecutorAndHostIds = (('A' to 'Z').map("host" + _) ++ (1 to 
100).map{_.toString}).toSet
+
+  /**
+   * Its easier to write our tests as if we could directly look at the 
sets of nodes & executors in
+   * the blacklist.  However the api doesn't expose a set, so this is a 
simple way to test
+   * something similar, since we know the universe of values that might 
appear in these sets.
+   */
+  def assertEquivalentToSet(f: String => Boolean, expected: Set[String]): 
Unit = {
+allExecutorAndHostIds.foreach { id =>
+  val actual = f(id)
+  val exp = expected.contains(id)
+  assert(actual === exp, raw"""for string "$id" """)
+}
+  }
+
+  def mockTaskSchedWithConf(conf: SparkConf): TaskSchedulerImpl = {
+sc = new SparkContext(conf)
+val scheduler = mock[TaskSchedulerImpl]
+when(scheduler.sc).thenReturn(sc)
+
when(scheduler.mapOutputTracker).thenReturn(SparkEnv.get.mapOutputTracker)
+scheduler
+  }
+
+  def createTaskSetBlacklist(stageId: Int = 0): TaskSetBlacklist = {
+new TaskSetBlacklist(conf, stageId, clock)
+  }
+
+  test("executors can be blacklisted with only a few failures per stage") {
+// For 4 different stages, executor 1 fails a task, then executor 2 
succeeds the task,
+// and then the task set is done.  Not enough failures to blacklist 
the executor *within*
+// any particular taskset, but we still blacklist the executor overall 
eventually.
+// Also, we intentionally have a mix of task successes and failures -- 
there are even some
+// successes after the executor is blacklisted.  The idea here is 
those tasks get scheduled
+// before the executor is blacklisted.  We might get successes after 
blacklisting (because the
+// executor might be flaky but not totally broken).  But successes 
should not unblacklist the
+// executor.
+val failuresUntilBlacklisted = conf.get(config.MAX_FAILURES_PER_EXEC)
+var failuresSoFar = 0
+(0 until failuresUntilBlacklisted * 10).foreach { stageId =>
+  val taskSetBlacklist = createTaskSetBlacklist(stageId)
+  if (stageId % 2 == 0) {
+// fail every other task
--- End diff --

fail a task in every other stage?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14079: [SPARK-8425][CORE] Application Level Blacklisting

2016-12-12 Thread kayousterhout

Github user kayousterhout commented on a diff in the pull request:

https://github.com/apache/spark/pull/14079#discussion_r92076411
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala ---
@@ -17,10 +17,275 @@
 
 package org.apache.spark.scheduler
 
+import java.util.concurrent.atomic.AtomicReference
+
+import scala.collection.mutable.{ArrayBuffer, HashMap, HashSet}
+
 import org.apache.spark.SparkConf
 import org.apache.spark.internal.Logging
 import org.apache.spark.internal.config
-import org.apache.spark.util.Utils
+import org.apache.spark.util.{Clock, SystemClock, Utils}
+
+/**
+ * BlacklistTracker is designed to track problematic executors and nodes.  
It supports blacklisting
+ * executors and nodes across an entire application (with a periodic 
expiry).  TaskSetManagers add
+ * additional blacklisting of executors and nodes for individual tasks and 
stages which works in
+ * concert with the blacklisting here.
+ *
+ * The tracker needs to deal with a variety of workloads, eg.:
+ *
+ *  * bad user code --  this may lead to many task failures, but that 
should not count against
+ *  individual executors
+ *  * many small stages -- this may prevent a bad executor for having many 
failures within one
+ *  stage, but still many failures over the entire application
+ *  * "flaky" executors -- they don't fail every task, but are still 
faulty enough to merit
+ *  blacklisting
+ *
+ * See the design doc on SPARK-8425 for a more in-depth discussion.
+ *
+ * THREADING: As with most helpers of TaskSchedulerImpl, this is not 
thread-safe.  Though it is
+ * called by multiple threads, callers must already have a lock on the 
TaskSchedulerImpl.  The
+ * one exception is [[nodeBlacklist()]], which can be called without 
holding a lock.
+ */
+private[scheduler] class BlacklistTracker (
+conf: SparkConf,
+clock: Clock = new SystemClock()) extends Logging {
+
+  BlacklistTracker.validateBlacklistConfs(conf)
+  private val MAX_FAILURES_PER_EXEC = 
conf.get(config.MAX_FAILURES_PER_EXEC)
+  private val MAX_FAILED_EXEC_PER_NODE = 
conf.get(config.MAX_FAILED_EXEC_PER_NODE)
+  val BLACKLIST_TIMEOUT_MILLIS = BlacklistTracker.getBlacklistTimeout(conf)
+
+  /**
+   * A map from executorId to information on task failures.  Tracks the 
time of each task failure,
+   * so that we can avoid blacklisting executors due to failures that are 
very far apart.  We do not
+   * actively remove from this as soon as tasks hit their timeouts, to 
avoid the time it would take
+   * to do so.  But it will not grow too large, because as soon as an 
executor gets too many
+   * failures, we blacklist the executor and remove its entry here.
+   */
+  private val executorIdToFailureList = new  HashMap[String, 
ExecutorFailureList]()
+  val executorIdToBlacklistStatus = new HashMap[String, 
BlacklistedExecutor]()
+  val nodeIdToBlacklistExpiryTime = new HashMap[String, Long]()
+  /**
+   * An immutable copy of the set of nodes that are currently blacklisted. 
 Kept in an
+   * AtomicReference to make [[nodeBlacklist()]] thread-safe.
+   */
+  private val _nodeBlacklist = new AtomicReference[Set[String]](Set())
+  /**
+   * Time when the next blacklist will expire.  Used as a
+   * shortcut to avoid iterating over all entries in the blacklist when 
none will have expired.
+   */
+  var nextExpiryTime: Long = Long.MaxValue
+  /**
+   * Mapping from nodes to all of the executors that have been blacklisted 
on that node. We do *not*
+   * remove from this when executors are removed from spark, so we can 
track when we get multiple
+   * successive blacklisted executors on one node.  Nonetheless, it will 
not grow too large because
+   * there cannot be many blacklisted executors on one node, before we 
stop requesting more
+   * executors on that node, and we clean up the list of blacklisted 
executors once an executor has
+   * been blacklisted for BLACKLIST_TIMEOUT_MILLIS.
+   */
+  val nodeToBlacklistedExecs = new HashMap[String, HashSet[String]]()
+
+  /**
+   * Un-blacklists executors and nodes that have been blacklisted for at 
least
+   * BLACKLIST_TIMEOUT_MILLIS
+   */
+  def applyBlacklistTimeout(): Unit = {
+val now = clock.getTimeMillis()
+// quickly check if we've got anything to expire from blacklist -- if 
not, avoid doing any work
+if (now > nextExpiryTime) {
+  // Apply the timeout to blacklisted nodes and executors
+  val execsToUnblacklist = 
executorIdToBlacklistStatus.filter(_._2.expiryTime < now).keys
+  if (execsToUnblacklist.nonEmpty) {
+// Un-blacklist any executors that have been blacklisted longer

[GitHub] spark pull request #14079: [SPARK-8425][CORE] Application Level Blacklisting

2016-12-12 Thread kayousterhout

Github user kayousterhout commented on a diff in the pull request:

https://github.com/apache/spark/pull/14079#discussion_r92079426
  
--- Diff: 
core/src/test/scala/org/apache/spark/scheduler/BlacklistTrackerSuite.scala ---
@@ -17,10 +17,322 @@
 
 package org.apache.spark.scheduler
 
-import org.apache.spark.{SparkConf, SparkFunSuite}
+import org.mockito.Mockito.when
+import org.scalatest.BeforeAndAfterEach
+import org.scalatest.mock.MockitoSugar
+
+import org.apache.spark._
 import org.apache.spark.internal.config
+import org.apache.spark.util.ManualClock
+
+class BlacklistTrackerSuite extends SparkFunSuite with BeforeAndAfterEach 
with MockitoSugar
+with LocalSparkContext {
+
+  private val clock = new ManualClock(0)
+
+  private var blacklist: BlacklistTracker = _
+  private var scheduler: TaskSchedulerImpl = _
+  private var conf: SparkConf = _
+
+  override def beforeEach(): Unit = {
+conf = new SparkConf().setAppName("test").setMaster("local")
+  .set(config.BLACKLIST_ENABLED.key, "true")
+scheduler = mockTaskSchedWithConf(conf)
+
+clock.setTime(0)
+blacklist = new BlacklistTracker(conf, clock)
+  }
+
+  override def afterEach(): Unit = {
+if (blacklist != null) {
+  blacklist = null
+}
+if (scheduler != null) {
+  scheduler.stop()
+  scheduler = null
+}
+super.afterEach()
+  }
+
+  val allExecutorAndHostIds = (('A' to 'Z').map("host" + _) ++ (1 to 
100).map{_.toString}).toSet
--- End diff --

can you add a comment here saying all host / executor IDs used in this 
class's tests should be in this list so the below function works? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14079: [SPARK-8425][CORE] Application Level Blacklisting

2016-12-12 Thread kayousterhout

Github user kayousterhout commented on a diff in the pull request:

https://github.com/apache/spark/pull/14079#discussion_r90543377
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala ---
@@ -17,10 +17,275 @@
 
 package org.apache.spark.scheduler
 
+import java.util.concurrent.atomic.AtomicReference
+
+import scala.collection.mutable.{ArrayBuffer, HashMap, HashSet}
+
 import org.apache.spark.SparkConf
 import org.apache.spark.internal.Logging
 import org.apache.spark.internal.config
-import org.apache.spark.util.Utils
+import org.apache.spark.util.{Clock, SystemClock, Utils}
+
+/**
+ * BlacklistTracker is designed to track problematic executors and nodes.  
It supports blacklisting
+ * executors and nodes across an entire application (with a periodic 
expiry).  TaskSetManagers add
+ * additional blacklisting of executors and nodes for individual tasks and 
stages which works in
+ * concert with the blacklisting here.
+ *
+ * The tracker needs to deal with a variety of workloads, eg.:
+ *
+ *  * bad user code --  this may lead to many task failures, but that 
should not count against
+ *  individual executors
+ *  * many small stages -- this may prevent a bad executor for having many 
failures within one
+ *  stage, but still many failures over the entire application
+ *  * "flaky" executors -- they don't fail every task, but are still 
faulty enough to merit
+ *  blacklisting
+ *
+ * See the design doc on SPARK-8425 for a more in-depth discussion.
+ *
+ * THREADING: As with most helpers of TaskSchedulerImpl, this is not 
thread-safe.  Though it is
+ * called by multiple threads, callers must already have a lock on the 
TaskSchedulerImpl.  The
+ * one exception is [[nodeBlacklist()]], which can be called without 
holding a lock.
+ */
+private[scheduler] class BlacklistTracker (
+conf: SparkConf,
+clock: Clock = new SystemClock()) extends Logging {
+
+  BlacklistTracker.validateBlacklistConfs(conf)
+  private val MAX_FAILURES_PER_EXEC = 
conf.get(config.MAX_FAILURES_PER_EXEC)
+  private val MAX_FAILED_EXEC_PER_NODE = 
conf.get(config.MAX_FAILED_EXEC_PER_NODE)
+  val BLACKLIST_TIMEOUT_MILLIS = BlacklistTracker.getBlacklistTimeout(conf)
+
+  /**
+   * A map from executorId to information on task failures.  Tracks the 
time of each task failure,
+   * so that we can avoid blacklisting executors due to failures that are 
very far apart.  We do not
+   * actively remove from this as soon as tasks hit their timeouts, to 
avoid the time it would take
+   * to do so.  But it will not grow too large, because as soon as an 
executor gets too many
+   * failures, we blacklist the executor and remove its entry here.
+   */
+  private val executorIdToFailureList = new  HashMap[String, 
ExecutorFailureList]()
+  val executorIdToBlacklistStatus = new HashMap[String, 
BlacklistedExecutor]()
+  val nodeIdToBlacklistExpiryTime = new HashMap[String, Long]()
+  /**
+   * An immutable copy of the set of nodes that are currently blacklisted. 
 Kept in an
+   * AtomicReference to make [[nodeBlacklist()]] thread-safe.
+   */
+  private val _nodeBlacklist = new AtomicReference[Set[String]](Set())
+  /**
+   * Time when the next blacklist will expire.  Used as a
+   * shortcut to avoid iterating over all entries in the blacklist when 
none will have expired.
+   */
+  var nextExpiryTime: Long = Long.MaxValue
+  /**
+   * Mapping from nodes to all of the executors that have been blacklisted 
on that node. We do *not*
+   * remove from this when executors are removed from spark, so we can 
track when we get multiple
+   * successive blacklisted executors on one node.  Nonetheless, it will 
not grow too large because
+   * there cannot be many blacklisted executors on one node, before we 
stop requesting more
+   * executors on that node, and we clean up the list of blacklisted 
executors once an executor has
+   * been blacklisted for BLACKLIST_TIMEOUT_MILLIS.
+   */
+  val nodeToBlacklistedExecs = new HashMap[String, HashSet[String]]()
+
+  /**
+   * Un-blacklists executors and nodes that have been blacklisted for at 
least
+   * BLACKLIST_TIMEOUT_MILLIS
+   */
+  def applyBlacklistTimeout(): Unit = {
+val now = clock.getTimeMillis()
+// quickly check if we've got anything to expire from blacklist -- if 
not, avoid doing any work
+if (now > nextExpiryTime) {
+  // Apply the timeout to blacklisted nodes and executors
+  val execsToUnblacklist = 
executorIdToBlacklistStatus.filter(_._2.expiryTime < now).keys
+  if (execsToUnblacklist.nonEmpty) {
+// Un-blacklist any executors that have been blacklisted longer

[GitHub] spark pull request #14079: [SPARK-8425][CORE] Application Level Blacklisting

2016-12-12 Thread kayousterhout

Github user kayousterhout commented on a diff in the pull request:

https://github.com/apache/spark/pull/14079#discussion_r92078434
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala ---
@@ -584,6 +614,14 @@ private[spark] class TaskSchedulerImpl(
 executorIdToTaskCount.getOrElse(execId, -1) > 0
   }
 
+  /**
+   * Get a snapshot of the currently blacklisted nodes for the entire 
application.  This is
+   * thread-safe -- it can be called without a lock on the TaskScheduler.
+   */
+  def nodeBlacklist(): scala.collection.immutable.Set[String] = {
--- End diff --

Got it that makes sense / I agree.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14079: [SPARK-8425][CORE] Application Level Blacklisting

2016-12-12 Thread kayousterhout

Github user kayousterhout commented on a diff in the pull request:

https://github.com/apache/spark/pull/14079#discussion_r92079769
  
--- Diff: 
core/src/test/scala/org/apache/spark/scheduler/BlacklistTrackerSuite.scala ---
@@ -17,10 +17,322 @@
 
 package org.apache.spark.scheduler
 
-import org.apache.spark.{SparkConf, SparkFunSuite}
+import org.mockito.Mockito.when
+import org.scalatest.BeforeAndAfterEach
+import org.scalatest.mock.MockitoSugar
+
+import org.apache.spark._
 import org.apache.spark.internal.config
+import org.apache.spark.util.ManualClock
+
+class BlacklistTrackerSuite extends SparkFunSuite with BeforeAndAfterEach 
with MockitoSugar
+with LocalSparkContext {
+
+  private val clock = new ManualClock(0)
+
+  private var blacklist: BlacklistTracker = _
+  private var scheduler: TaskSchedulerImpl = _
+  private var conf: SparkConf = _
+
+  override def beforeEach(): Unit = {
+conf = new SparkConf().setAppName("test").setMaster("local")
+  .set(config.BLACKLIST_ENABLED.key, "true")
+scheduler = mockTaskSchedWithConf(conf)
+
+clock.setTime(0)
+blacklist = new BlacklistTracker(conf, clock)
+  }
+
+  override def afterEach(): Unit = {
+if (blacklist != null) {
+  blacklist = null
+}
+if (scheduler != null) {
+  scheduler.stop()
+  scheduler = null
+}
+super.afterEach()
+  }
+
+  val allExecutorAndHostIds = (('A' to 'Z').map("host" + _) ++ (1 to 
100).map{_.toString}).toSet
+
+  /**
+   * Its easier to write our tests as if we could directly look at the 
sets of nodes & executors in
+   * the blacklist.  However the api doesn't expose a set, so this is a 
simple way to test
+   * something similar, since we know the universe of values that might 
appear in these sets.
+   */
+  def assertEquivalentToSet(f: String => Boolean, expected: Set[String]): 
Unit = {
+allExecutorAndHostIds.foreach { id =>
+  val actual = f(id)
+  val exp = expected.contains(id)
+  assert(actual === exp, raw"""for string "$id" """)
+}
+  }
+
+  def mockTaskSchedWithConf(conf: SparkConf): TaskSchedulerImpl = {
+sc = new SparkContext(conf)
+val scheduler = mock[TaskSchedulerImpl]
+when(scheduler.sc).thenReturn(sc)
+
when(scheduler.mapOutputTracker).thenReturn(SparkEnv.get.mapOutputTracker)
+scheduler
+  }
+
+  def createTaskSetBlacklist(stageId: Int = 0): TaskSetBlacklist = {
+new TaskSetBlacklist(conf, stageId, clock)
+  }
+
+  test("executors can be blacklisted with only a few failures per stage") {
+// For 4 different stages, executor 1 fails a task, then executor 2 
succeeds the task,
--- End diff --

I don't think this is for 4 different stages? (looks like ~20?)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14079: [SPARK-8425][CORE] Application Level Blacklisting

2016-12-12 Thread kayousterhout

Github user kayousterhout commented on a diff in the pull request:

https://github.com/apache/spark/pull/14079#discussion_r90543538
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala ---
@@ -17,10 +17,254 @@
 
 package org.apache.spark.scheduler
 
+import java.util.concurrent.atomic.AtomicReference
+
+import scala.collection.mutable.{ArrayBuffer, HashMap, HashSet}
+
 import org.apache.spark.SparkConf
 import org.apache.spark.internal.Logging
 import org.apache.spark.internal.config
-import org.apache.spark.util.Utils
+import org.apache.spark.util.{Clock, SystemClock, Utils}
+
+/**
+ * BlacklistTracker is designed to track problematic executors and nodes.  
It supports blacklisting
+ * executors and nodes across an entire application (with a periodic 
expiry).  TaskSetManagers add
+ * additional blacklisting of executors and nodes for individual tasks and 
stages which works in
+ * concert with the blacklisting here.
+ *
+ * The tracker needs to deal with a variety of workloads, eg.:
+ *
+ *  * bad user code --  this may lead to many task failures, but that 
should not count against
+ *  individual executors
+ *  * many small stages -- this may prevent a bad executor for having many 
failures within one
+ *  stage, but still many failures over the entire application
+ *  * "flaky" executors -- they don't fail every task, but are still 
faulty enough to merit
+ *  blacklisting
+ *
+ * See the design doc on SPARK-8425 for a more in-depth discussion.
+ *
+ * THREADING: As with most helpers of TaskSchedulerImpl, this is not 
thread-safe.  Though it is
+ * called by multiple threads, callers must already have a lock on the 
TaskSchedulerImpl.  The
+ * one exception is [[nodeBlacklist()]], which can be called without 
holding a lock.
+ */
+private[scheduler] class BlacklistTracker (
+conf: SparkConf,
+clock: Clock = new SystemClock()) extends Logging {
+
+  BlacklistTracker.validateBlacklistConfs(conf)
+  private val MAX_FAILURES_PER_EXEC = 
conf.get(config.MAX_FAILURES_PER_EXEC)
+  private val MAX_FAILED_EXEC_PER_NODE = 
conf.get(config.MAX_FAILED_EXEC_PER_NODE)
+  val BLACKLIST_TIMEOUT_MILLIS = BlacklistTracker.getBlacklistTimeout(conf)
+
+  /**
+   * A map from executorId to information on task failures.  Tracks the 
time of each task failure,
+   * so that we can avoid blacklisting executors due to failures that are 
very far apart.  We do not
+   * actively remove from this as soon as tasks hit their timeouts, to 
avoid the time it would take
+   * to do so.  But it will not grow too large, because as soon as an 
executor gets too many
+   * failures, we blacklist the executor and remove its entry here.
+   */
+  private val executorIdToFailureList = new  HashMap[String, 
ExecutorFailureList]()
+  val executorIdToBlacklistStatus = new HashMap[String, 
BlacklistedExecutor]()
+  val nodeIdToBlacklistExpiryTime = new HashMap[String, Long]()
+  /**
+   * An immutable copy of the set of nodes that are currently blacklisted. 
 Kept in an
+   * AtomicReference to make [[nodeBlacklist()]] thread-safe.
+   */
+  private val _nodeBlacklist = new AtomicReference[Set[String]](Set())
+  /**
+   * Time when the next blacklist will expire.  Used as a
+   * shortcut to avoid iterating over all entries in the blacklist when 
none will have expired.
+   */
+  var nextExpiryTime: Long = Long.MaxValue
+  /**
+   * Mapping from nodes to all of the executors that have been blacklisted 
on that node. We do *not*
+   * remove from this when executors are removed from spark, so we can 
track when we get multiple
+   * successive blacklisted executors on one node.  Nonetheless, it will 
not grow too large because
+   * there cannot be many blacklisted executors on one node, before we 
stop requesting more
+   * executors on that node, and we periodically clean up the list of 
blacklisted executors.
+   */
+  val nodeToBlacklistedExecs = new HashMap[String, HashSet[String]]()
+
+  /**
+   * Un-blacklists executors and nodes that have been blacklisted for at 
least
+   * BLACKLIST_TIMEOUT_MILLIS
+   */
+  def applyBlacklistTimeout(): Unit = {
+val now = clock.getTimeMillis()
+// quickly check if we've got anything to expire from blacklist -- if 
not, avoid doing any work
+if (now > nextExpiryTime) {
+  // Apply the timeout to blacklisted nodes and executors
+  val execsToUnblacklist = 
executorIdToBlacklistStatus.filter(_._2.expiryTime < now).keys
+  if (execsToUnblacklist.nonEmpty) {
+// Un-blacklist any executors that have been blacklisted longer 
than the blacklist timeout.
+logInfo(s"Removing ex

[GitHub] spark pull request #14079: [SPARK-8425][CORE] Application Level Blacklisting

2016-12-12 Thread kayousterhout

Github user kayousterhout commented on a diff in the pull request:

https://github.com/apache/spark/pull/14079#discussion_r92103550
  
--- Diff: 
yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala ---
@@ -217,18 +219,35 @@ private[yarn] class YarnAllocator(
* @param localityAwareTasks number of locality aware tasks to be used 
as container placement hint
* @param hostToLocalTaskCount a map of preferred hostname to possible 
task counts to be used as
* container placement hint.
+   * @param nodeBlacklist a set of blacklisted nodes, which is passed in 
to avoid allocating new
+*  containers on them. It will be used to update 
the application master's
+*  blacklist.
* @return Whether the new requested total is different than the old 
value.
*/
   def requestTotalExecutorsWithPreferredLocalities(
   requestedTotal: Int,
   localityAwareTasks: Int,
-  hostToLocalTaskCount: Map[String, Int]): Boolean = synchronized {
+  hostToLocalTaskCount: Map[String, Int],
+  nodeBlacklist: Set[String]): Boolean = synchronized {
 this.numLocalityAwareTasks = localityAwareTasks
 this.hostToLocalTaskCounts = hostToLocalTaskCount
 
 if (requestedTotal != targetNumExecutors) {
   logInfo(s"Driver requested a total number of $requestedTotal 
executor(s).")
   targetNumExecutors = requestedTotal
+
+  // Update blacklist infomation to YARN ResouceManager for this 
application,
+  // in order to avoid allocating new Containers on the problematic 
nodes.
+  val blacklistAdditions = nodeBlacklist -- currentNodeBlacklist
+  val blacklistRemovals = currentNodeBlacklist -- nodeBlacklist
+  if (blacklistAdditions.nonEmpty) {
+logInfo(s"adding nodes to blacklist: $blacklistAdditions")
--- End diff --

can you add "YARN application master" here and below? (so "adding nodes to 
YARN application master's blacklist: $...")

(to differentiate clearly from the app-level blacklisting used elsewhere)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14079: [SPARK-8425][CORE] Application Level Blacklisting

2016-12-12 Thread kayousterhout

Github user kayousterhout commented on a diff in the pull request:

https://github.com/apache/spark/pull/14079#discussion_r92081806
  
--- Diff: 
core/src/test/scala/org/apache/spark/scheduler/TaskSchedulerImplSuite.scala ---
@@ -408,6 +411,96 @@ class TaskSchedulerImplSuite extends SparkFunSuite 
with LocalSparkContext with B
   }
   assert(tsm.isZombie)
 }
+
+// the tasksSets complete, so the tracker should be notified of the 
successful ones
+verify(blacklist, times(1)).updateBlacklistForSuccessfulTaskSet(
+  stageId = 0,
+  stageAttemptId = 0,
+  failuresByExec = stageToMockTaskSetBlacklist(0).execToFailures)
+verify(blacklist, times(1)).updateBlacklistForSuccessfulTaskSet(
+  stageId = 1,
+  stageAttemptId = 0,
+  failuresByExec = stageToMockTaskSetBlacklist(1).execToFailures)
+// but we shouldn't update for the failed taskset
+verify(blacklist, never).updateBlacklistForSuccessfulTaskSet(
+  stageId = meq(2),
+  stageAttemptId = anyInt(),
+  failuresByExec = anyObject())
+  }
+
+  test("scheduled tasks obey node and executor blacklists") {
--- End diff --

why not have this test be a normal case (i.e., when the tsm doesn't get 
aborted, but the blacklist is obeyed)? That seems like the more common case, 
and abortion is tested in the test below


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14079: [SPARK-8425][CORE] Application Level Blacklisting

2016-12-12 Thread kayousterhout

Github user kayousterhout commented on a diff in the pull request:

https://github.com/apache/spark/pull/14079#discussion_r92103779
  
--- Diff: 
yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala
 ---
@@ -121,13 +121,20 @@ private[spark] abstract class YarnSchedulerBackend(
 }
   }
 
+  private[cluster] def prepareRequestExecutors(requestedTotal: Int): 
RequestExecutors = {
+val nodeBlacklist: Set[String] = scheduler.nodeBlacklist()
+val filteredHostToLocalTaskCount =
+  hostToLocalTaskCount.filter { case (k, v) => 
!nodeBlacklist.contains(k) }
--- End diff --

Ok cool can you add a comment here with the first thing that you said ("For 
locality preferences, ignore preferences for nodes that are blacklisted")


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14079: [SPARK-8425][CORE] Application Level Blacklisting

2016-12-12 Thread kayousterhout

Github user kayousterhout commented on a diff in the pull request:

https://github.com/apache/spark/pull/14079#discussion_r92103358
  
--- Diff: docs/configuration.md ---
@@ -1339,6 +1347,28 @@ Apart from these, the following properties are also 
available, and may be useful
   
 
 
+  
spark.blacklist.application.maxFailedTasksPerExecutor
+  2
+  
+(Experimental) How many different tasks must fail on one executor, in 
successful task sets,
+before the executor is blacklisted for the entire application.  
Blacklisted executors will
+be automatically added back to the pool of available resources after 
the timeout specified by
+spark.blacklist.timeout.  Note that with dynamic 
allocation, though, the executors
+may get marked as idle and be reclaimed by the cluster manager.
+  
+
+
+  
spark.blacklist.application.maxFailedExecutorsPerNode
+  2
+  
+(Experimental) How many different executors must be blacklisted for 
the entire application,
+before the node is blacklisted for the entire application.  
Blacklisted nodes will
+be automatically added back to the pool of available resources after 
the timeout specified by
+spark.blacklist.timeout.  Note that with dynamic 
allocation, though, the executors
+may get marked as idle and be reclaimed by the cluster manager.
--- End diff --

the executors on the node may


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14079: [SPARK-8425][CORE] Application Level Blacklisting

2016-12-12 Thread kayousterhout

Github user kayousterhout commented on a diff in the pull request:

https://github.com/apache/spark/pull/14079#discussion_r92077755
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/ExecutorFailuresInTaskSet.scala 
---
@@ -25,26 +25,30 @@ import scala.collection.mutable.HashMap
 private[scheduler] class ExecutorFailuresInTaskSet(val node: String) {
   /**
* Mapping from index of the tasks in the taskset, to the number of 
times it has failed on this
-   * executor.
+   * executor and the failure time.
--- End diff --

can you say "and the last failure time" or "and the most recent failure 
time" (o/w confusing since there's a count of failures, and which is the time 
for?)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14079: [SPARK-8425][CORE] Application Level Blacklisting

2016-12-12 Thread kayousterhout

Github user kayousterhout commented on a diff in the pull request:

https://github.com/apache/spark/pull/14079#discussion_r92081374
  
--- Diff: 
core/src/test/scala/org/apache/spark/scheduler/SchedulerIntegrationSuite.scala 
---
@@ -157,8 +160,16 @@ abstract class SchedulerIntegrationSuite[T <: 
MockBackend: ClassTag] extends Spa
   }
   // When a job fails, we terminate before waiting for all the task 
end events to come in,
   // so there might still be a running task set.  So we only check 
these conditions
-  // when the job succeeds
-  assert(taskScheduler.runningTaskSets.isEmpty)
+  // when the job succeeds.
+  // When the final task of a taskset completes, we post
+  // the event to the DAGScheduler event loop before we finish 
processing in the taskscheduler
+  // thread.  Its possible the DAGScheduler thread processes the 
event, finishes the job,
--- End diff --

It's


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14079: [SPARK-8425][CORE] Application Level Blacklisting

2016-12-12 Thread kayousterhout

Github user kayousterhout commented on a diff in the pull request:

https://github.com/apache/spark/pull/14079#discussion_r90543965
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala ---
@@ -17,10 +17,275 @@
 
 package org.apache.spark.scheduler
 
+import java.util.concurrent.atomic.AtomicReference
+
+import scala.collection.mutable.{ArrayBuffer, HashMap, HashSet}
+
 import org.apache.spark.SparkConf
 import org.apache.spark.internal.Logging
 import org.apache.spark.internal.config
-import org.apache.spark.util.Utils
+import org.apache.spark.util.{Clock, SystemClock, Utils}
+
+/**
+ * BlacklistTracker is designed to track problematic executors and nodes.  
It supports blacklisting
+ * executors and nodes across an entire application (with a periodic 
expiry).  TaskSetManagers add
+ * additional blacklisting of executors and nodes for individual tasks and 
stages which works in
+ * concert with the blacklisting here.
+ *
+ * The tracker needs to deal with a variety of workloads, eg.:
+ *
+ *  * bad user code --  this may lead to many task failures, but that 
should not count against
+ *  individual executors
+ *  * many small stages -- this may prevent a bad executor for having many 
failures within one
+ *  stage, but still many failures over the entire application
+ *  * "flaky" executors -- they don't fail every task, but are still 
faulty enough to merit
+ *  blacklisting
+ *
+ * See the design doc on SPARK-8425 for a more in-depth discussion.
+ *
+ * THREADING: As with most helpers of TaskSchedulerImpl, this is not 
thread-safe.  Though it is
+ * called by multiple threads, callers must already have a lock on the 
TaskSchedulerImpl.  The
+ * one exception is [[nodeBlacklist()]], which can be called without 
holding a lock.
+ */
+private[scheduler] class BlacklistTracker (
+conf: SparkConf,
+clock: Clock = new SystemClock()) extends Logging {
+
+  BlacklistTracker.validateBlacklistConfs(conf)
+  private val MAX_FAILURES_PER_EXEC = 
conf.get(config.MAX_FAILURES_PER_EXEC)
+  private val MAX_FAILED_EXEC_PER_NODE = 
conf.get(config.MAX_FAILED_EXEC_PER_NODE)
+  val BLACKLIST_TIMEOUT_MILLIS = BlacklistTracker.getBlacklistTimeout(conf)
+
+  /**
+   * A map from executorId to information on task failures.  Tracks the 
time of each task failure,
+   * so that we can avoid blacklisting executors due to failures that are 
very far apart.  We do not
+   * actively remove from this as soon as tasks hit their timeouts, to 
avoid the time it would take
+   * to do so.  But it will not grow too large, because as soon as an 
executor gets too many
+   * failures, we blacklist the executor and remove its entry here.
+   */
+  private val executorIdToFailureList = new  HashMap[String, 
ExecutorFailureList]()
+  val executorIdToBlacklistStatus = new HashMap[String, 
BlacklistedExecutor]()
+  val nodeIdToBlacklistExpiryTime = new HashMap[String, Long]()
+  /**
+   * An immutable copy of the set of nodes that are currently blacklisted. 
 Kept in an
+   * AtomicReference to make [[nodeBlacklist()]] thread-safe.
+   */
+  private val _nodeBlacklist = new AtomicReference[Set[String]](Set())
+  /**
+   * Time when the next blacklist will expire.  Used as a
+   * shortcut to avoid iterating over all entries in the blacklist when 
none will have expired.
+   */
+  var nextExpiryTime: Long = Long.MaxValue
+  /**
+   * Mapping from nodes to all of the executors that have been blacklisted 
on that node. We do *not*
+   * remove from this when executors are removed from spark, so we can 
track when we get multiple
+   * successive blacklisted executors on one node.  Nonetheless, it will 
not grow too large because
+   * there cannot be many blacklisted executors on one node, before we 
stop requesting more
+   * executors on that node, and we clean up the list of blacklisted 
executors once an executor has
+   * been blacklisted for BLACKLIST_TIMEOUT_MILLIS.
+   */
+  val nodeToBlacklistedExecs = new HashMap[String, HashSet[String]]()
+
+  /**
+   * Un-blacklists executors and nodes that have been blacklisted for at 
least
+   * BLACKLIST_TIMEOUT_MILLIS
+   */
+  def applyBlacklistTimeout(): Unit = {
+val now = clock.getTimeMillis()
+// quickly check if we've got anything to expire from blacklist -- if 
not, avoid doing any work
+if (now > nextExpiryTime) {
+  // Apply the timeout to blacklisted nodes and executors
+  val execsToUnblacklist = 
executorIdToBlacklistStatus.filter(_._2.expiryTime < now).keys
+  if (execsToUnblacklist.nonEmpty) {
+// Un-blacklist any executors that have been blacklisted longer

[GitHub] spark pull request #14079: [SPARK-8425][CORE] Application Level Blacklisting

2016-12-12 Thread kayousterhout

Github user kayousterhout commented on a diff in the pull request:

https://github.com/apache/spark/pull/14079#discussion_r92075850
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala ---
@@ -17,10 +17,275 @@
 
 package org.apache.spark.scheduler
 
+import java.util.concurrent.atomic.AtomicReference
+
+import scala.collection.mutable.{ArrayBuffer, HashMap, HashSet}
+
 import org.apache.spark.SparkConf
 import org.apache.spark.internal.Logging
 import org.apache.spark.internal.config
-import org.apache.spark.util.Utils
+import org.apache.spark.util.{Clock, SystemClock, Utils}
+
+/**
+ * BlacklistTracker is designed to track problematic executors and nodes.  
It supports blacklisting
+ * executors and nodes across an entire application (with a periodic 
expiry).  TaskSetManagers add
+ * additional blacklisting of executors and nodes for individual tasks and 
stages which works in
+ * concert with the blacklisting here.
+ *
+ * The tracker needs to deal with a variety of workloads, eg.:
+ *
+ *  * bad user code --  this may lead to many task failures, but that 
should not count against
+ *  individual executors
+ *  * many small stages -- this may prevent a bad executor for having many 
failures within one
+ *  stage, but still many failures over the entire application
+ *  * "flaky" executors -- they don't fail every task, but are still 
faulty enough to merit
+ *  blacklisting
+ *
+ * See the design doc on SPARK-8425 for a more in-depth discussion.
+ *
+ * THREADING: As with most helpers of TaskSchedulerImpl, this is not 
thread-safe.  Though it is
+ * called by multiple threads, callers must already have a lock on the 
TaskSchedulerImpl.  The
+ * one exception is [[nodeBlacklist()]], which can be called without 
holding a lock.
+ */
+private[scheduler] class BlacklistTracker (
+conf: SparkConf,
+clock: Clock = new SystemClock()) extends Logging {
+
+  BlacklistTracker.validateBlacklistConfs(conf)
+  private val MAX_FAILURES_PER_EXEC = 
conf.get(config.MAX_FAILURES_PER_EXEC)
+  private val MAX_FAILED_EXEC_PER_NODE = 
conf.get(config.MAX_FAILED_EXEC_PER_NODE)
+  val BLACKLIST_TIMEOUT_MILLIS = BlacklistTracker.getBlacklistTimeout(conf)
+
+  /**
+   * A map from executorId to information on task failures.  Tracks the 
time of each task failure,
+   * so that we can avoid blacklisting executors due to failures that are 
very far apart.  We do not
+   * actively remove from this as soon as tasks hit their timeouts, to 
avoid the time it would take
+   * to do so.  But it will not grow too large, because as soon as an 
executor gets too many
+   * failures, we blacklist the executor and remove its entry here.
+   */
+  private val executorIdToFailureList = new  HashMap[String, 
ExecutorFailureList]()
+  val executorIdToBlacklistStatus = new HashMap[String, 
BlacklistedExecutor]()
+  val nodeIdToBlacklistExpiryTime = new HashMap[String, Long]()
+  /**
+   * An immutable copy of the set of nodes that are currently blacklisted. 
 Kept in an
+   * AtomicReference to make [[nodeBlacklist()]] thread-safe.
+   */
+  private val _nodeBlacklist = new AtomicReference[Set[String]](Set())
+  /**
+   * Time when the next blacklist will expire.  Used as a
+   * shortcut to avoid iterating over all entries in the blacklist when 
none will have expired.
+   */
+  var nextExpiryTime: Long = Long.MaxValue
+  /**
+   * Mapping from nodes to all of the executors that have been blacklisted 
on that node. We do *not*
+   * remove from this when executors are removed from spark, so we can 
track when we get multiple
+   * successive blacklisted executors on one node.  Nonetheless, it will 
not grow too large because
+   * there cannot be many blacklisted executors on one node, before we 
stop requesting more
+   * executors on that node, and we clean up the list of blacklisted 
executors once an executor has
+   * been blacklisted for BLACKLIST_TIMEOUT_MILLIS.
+   */
+  val nodeToBlacklistedExecs = new HashMap[String, HashSet[String]]()
+
+  /**
+   * Un-blacklists executors and nodes that have been blacklisted for at 
least
+   * BLACKLIST_TIMEOUT_MILLIS
+   */
+  def applyBlacklistTimeout(): Unit = {
+val now = clock.getTimeMillis()
+// quickly check if we've got anything to expire from blacklist -- if 
not, avoid doing any work
+if (now > nextExpiryTime) {
+  // Apply the timeout to blacklisted nodes and executors
+  val execsToUnblacklist = 
executorIdToBlacklistStatus.filter(_._2.expiryTime < now).keys
+  if (execsToUnblacklist.nonEmpty) {
+// Un-blacklist any executors that have been blacklisted longer

[GitHub] spark pull request #14079: [SPARK-8425][CORE] Application Level Blacklisting

2016-12-12 Thread kayousterhout

Github user kayousterhout commented on a diff in the pull request:

https://github.com/apache/spark/pull/14079#discussion_r90147862
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala ---
@@ -17,10 +17,272 @@
 
 package org.apache.spark.scheduler
 
+import java.util.concurrent.atomic.AtomicReference
+
+import scala.collection.mutable.{ArrayBuffer, HashMap, HashSet}
+
 import org.apache.spark.SparkConf
 import org.apache.spark.internal.Logging
 import org.apache.spark.internal.config
-import org.apache.spark.util.Utils
+import org.apache.spark.util.{Clock, SystemClock, Utils}
+
+/**
+ * BlacklistTracker is designed to track problematic executors and nodes.  
It supports blacklisting
+ * executors and nodes across an entire application (with a periodic 
expiry).  TaskSetManagers add
+ * additional blacklisting of executors and nodes for individual tasks and 
stages which works in
+ * concert with the blacklisting here.
+ *
+ * The tracker needs to deal with a variety of workloads, eg.:
+ *
+ *  * bad user code --  this may lead to many task failures, but that 
should not count against
+ *  individual executors
+ *  * many small stages -- this may prevent a bad executor for having many 
failures within one
+ *  stage, but still many failures over the entire application
+ *  * "flaky" executors -- they don't fail every task, but are still 
faulty enough to merit
+ *  blacklisting
+ *
+ * See the design doc on SPARK-8425 for a more in-depth discussion.
+ *
+ * THREADING: As with most helpers of TaskSchedulerImpl, this is not 
thread-safe.  Though it is
+ * called by multiple threads, callers must already have a lock on the 
TaskSchedulerImpl.  The
+ * one exception is [[nodeBlacklist()]], which can be called without 
holding a lock.
+ */
+private[scheduler] class BlacklistTracker (
+conf: SparkConf,
+clock: Clock = new SystemClock()) extends Logging {
+
+  BlacklistTracker.validateBlacklistConfs(conf)
+  private val MAX_FAILURES_PER_EXEC = 
conf.get(config.MAX_FAILURES_PER_EXEC)
+  private val MAX_FAILED_EXEC_PER_NODE = 
conf.get(config.MAX_FAILED_EXEC_PER_NODE)
+  val BLACKLIST_TIMEOUT_MILLIS = BlacklistTracker.getBlacklistTimeout(conf)
+
+  /**
+   * A map from executorId to information on task failures.  Tracks the 
time of each task failure,
+   * so that we can avoid blacklisting executors due to failures that are 
very far apart.  We do not
+   * actively remove from this as soon as tasks hit their timeouts, to 
avoid the time it would take
+   * to do so.  But it will not grow too large, because as soon as an 
executor gets too many
+   * failures, we blacklist the executor and remove its entry here.
+   */
+  private val executorIdToFailureList = new  HashMap[String, 
ExecutorFailureList]()
--- End diff --

super nit: remove extra space between "new" and "HashMap"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14079: [SPARK-8425][CORE] Application Level Blacklisting

2016-12-12 Thread kayousterhout

Github user kayousterhout commented on a diff in the pull request:

https://github.com/apache/spark/pull/14079#discussion_r92103809
  
--- Diff: 
yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala
 ---
@@ -121,13 +121,20 @@ private[spark] abstract class YarnSchedulerBackend(
 }
   }
 
+  private[cluster] def prepareRequestExecutors(requestedTotal: Int): 
RequestExecutors = {
--- End diff --

Ok makes sense


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14079: [SPARK-8425][CORE] Application Level Blacklisting

2016-12-12 Thread kayousterhout

Github user kayousterhout commented on a diff in the pull request:

https://github.com/apache/spark/pull/14079#discussion_r92078684
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala ---
@@ -678,4 +716,13 @@ private[spark] object TaskSchedulerImpl {
 
 retval.toList
   }
+
+  private def maybeCreateBlacklistTracker(conf: SparkConf): 
Option[BlacklistTracker] = {
--- End diff --

does it make sense for this method to be in the BlacklistTracker object? 
(then it could just be BlacklistTracker.maybeCreate()?)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14079: [SPARK-8425][CORE] Application Level Blacklisting

2016-12-12 Thread kayousterhout

Github user kayousterhout commented on a diff in the pull request:

https://github.com/apache/spark/pull/14079#discussion_r92103410
  
--- Diff: 
yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala ---
@@ -691,11 +691,11 @@ private[spark] class ApplicationMaster(
 }
 
 override def receiveAndReply(context: RpcCallContext): 
PartialFunction[Any, Unit] = {
-  case RequestExecutors(requestedTotal, localityAwareTasks, 
hostToLocalTaskCount) =>
+  case r: RequestExecutors =>
--- End diff --

Ok cool seems fine


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14079: [SPARK-8425][CORE] Application Level Blacklisting

2016-12-12 Thread kayousterhout

Github user kayousterhout commented on a diff in the pull request:

https://github.com/apache/spark/pull/14079#discussion_r92076298
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala ---
@@ -17,10 +17,275 @@
 
 package org.apache.spark.scheduler
 
+import java.util.concurrent.atomic.AtomicReference
+
+import scala.collection.mutable.{ArrayBuffer, HashMap, HashSet}
+
 import org.apache.spark.SparkConf
 import org.apache.spark.internal.Logging
 import org.apache.spark.internal.config
-import org.apache.spark.util.Utils
+import org.apache.spark.util.{Clock, SystemClock, Utils}
+
+/**
+ * BlacklistTracker is designed to track problematic executors and nodes.  
It supports blacklisting
+ * executors and nodes across an entire application (with a periodic 
expiry).  TaskSetManagers add
+ * additional blacklisting of executors and nodes for individual tasks and 
stages which works in
+ * concert with the blacklisting here.
+ *
+ * The tracker needs to deal with a variety of workloads, eg.:
+ *
+ *  * bad user code --  this may lead to many task failures, but that 
should not count against
+ *  individual executors
+ *  * many small stages -- this may prevent a bad executor for having many 
failures within one
+ *  stage, but still many failures over the entire application
+ *  * "flaky" executors -- they don't fail every task, but are still 
faulty enough to merit
+ *  blacklisting
+ *
+ * See the design doc on SPARK-8425 for a more in-depth discussion.
+ *
+ * THREADING: As with most helpers of TaskSchedulerImpl, this is not 
thread-safe.  Though it is
+ * called by multiple threads, callers must already have a lock on the 
TaskSchedulerImpl.  The
+ * one exception is [[nodeBlacklist()]], which can be called without 
holding a lock.
+ */
+private[scheduler] class BlacklistTracker (
+conf: SparkConf,
+clock: Clock = new SystemClock()) extends Logging {
+
+  BlacklistTracker.validateBlacklistConfs(conf)
+  private val MAX_FAILURES_PER_EXEC = 
conf.get(config.MAX_FAILURES_PER_EXEC)
+  private val MAX_FAILED_EXEC_PER_NODE = 
conf.get(config.MAX_FAILED_EXEC_PER_NODE)
+  val BLACKLIST_TIMEOUT_MILLIS = BlacklistTracker.getBlacklistTimeout(conf)
+
+  /**
+   * A map from executorId to information on task failures.  Tracks the 
time of each task failure,
+   * so that we can avoid blacklisting executors due to failures that are 
very far apart.  We do not
+   * actively remove from this as soon as tasks hit their timeouts, to 
avoid the time it would take
+   * to do so.  But it will not grow too large, because as soon as an 
executor gets too many
+   * failures, we blacklist the executor and remove its entry here.
+   */
+  private val executorIdToFailureList = new  HashMap[String, 
ExecutorFailureList]()
+  val executorIdToBlacklistStatus = new HashMap[String, 
BlacklistedExecutor]()
+  val nodeIdToBlacklistExpiryTime = new HashMap[String, Long]()
+  /**
+   * An immutable copy of the set of nodes that are currently blacklisted. 
 Kept in an
+   * AtomicReference to make [[nodeBlacklist()]] thread-safe.
+   */
+  private val _nodeBlacklist = new AtomicReference[Set[String]](Set())
+  /**
+   * Time when the next blacklist will expire.  Used as a
+   * shortcut to avoid iterating over all entries in the blacklist when 
none will have expired.
+   */
+  var nextExpiryTime: Long = Long.MaxValue
+  /**
+   * Mapping from nodes to all of the executors that have been blacklisted 
on that node. We do *not*
+   * remove from this when executors are removed from spark, so we can 
track when we get multiple
+   * successive blacklisted executors on one node.  Nonetheless, it will 
not grow too large because
+   * there cannot be many blacklisted executors on one node, before we 
stop requesting more
+   * executors on that node, and we clean up the list of blacklisted 
executors once an executor has
+   * been blacklisted for BLACKLIST_TIMEOUT_MILLIS.
+   */
+  val nodeToBlacklistedExecs = new HashMap[String, HashSet[String]]()
+
+  /**
+   * Un-blacklists executors and nodes that have been blacklisted for at 
least
+   * BLACKLIST_TIMEOUT_MILLIS
+   */
+  def applyBlacklistTimeout(): Unit = {
+val now = clock.getTimeMillis()
+// quickly check if we've got anything to expire from blacklist -- if 
not, avoid doing any work
+if (now > nextExpiryTime) {
+  // Apply the timeout to blacklisted nodes and executors
+  val execsToUnblacklist = 
executorIdToBlacklistStatus.filter(_._2.expiryTime < now).keys
+  if (execsToUnblacklist.nonEmpty) {
+// Un-blacklist any executors that have been blacklisted longer

[GitHub] spark pull request #14079: [SPARK-8425][CORE] Application Level Blacklisting

2016-12-12 Thread kayousterhout

Github user kayousterhout commented on a diff in the pull request:

https://github.com/apache/spark/pull/14079#discussion_r92079984
  
--- Diff: 
core/src/test/scala/org/apache/spark/scheduler/BlacklistTrackerSuite.scala ---
@@ -17,10 +17,322 @@
 
 package org.apache.spark.scheduler
 
-import org.apache.spark.{SparkConf, SparkFunSuite}
+import org.mockito.Mockito.when
+import org.scalatest.BeforeAndAfterEach
+import org.scalatest.mock.MockitoSugar
+
+import org.apache.spark._
 import org.apache.spark.internal.config
+import org.apache.spark.util.ManualClock
+
+class BlacklistTrackerSuite extends SparkFunSuite with BeforeAndAfterEach 
with MockitoSugar
+with LocalSparkContext {
+
+  private val clock = new ManualClock(0)
+
+  private var blacklist: BlacklistTracker = _
+  private var scheduler: TaskSchedulerImpl = _
+  private var conf: SparkConf = _
+
+  override def beforeEach(): Unit = {
+conf = new SparkConf().setAppName("test").setMaster("local")
+  .set(config.BLACKLIST_ENABLED.key, "true")
+scheduler = mockTaskSchedWithConf(conf)
+
+clock.setTime(0)
+blacklist = new BlacklistTracker(conf, clock)
+  }
+
+  override def afterEach(): Unit = {
+if (blacklist != null) {
+  blacklist = null
+}
+if (scheduler != null) {
+  scheduler.stop()
+  scheduler = null
+}
+super.afterEach()
+  }
+
+  val allExecutorAndHostIds = (('A' to 'Z').map("host" + _) ++ (1 to 
100).map{_.toString}).toSet
+
+  /**
+   * Its easier to write our tests as if we could directly look at the 
sets of nodes & executors in
+   * the blacklist.  However the api doesn't expose a set, so this is a 
simple way to test
+   * something similar, since we know the universe of values that might 
appear in these sets.
+   */
+  def assertEquivalentToSet(f: String => Boolean, expected: Set[String]): 
Unit = {
+allExecutorAndHostIds.foreach { id =>
+  val actual = f(id)
+  val exp = expected.contains(id)
+  assert(actual === exp, raw"""for string "$id" """)
+}
+  }
+
+  def mockTaskSchedWithConf(conf: SparkConf): TaskSchedulerImpl = {
+sc = new SparkContext(conf)
+val scheduler = mock[TaskSchedulerImpl]
+when(scheduler.sc).thenReturn(sc)
+
when(scheduler.mapOutputTracker).thenReturn(SparkEnv.get.mapOutputTracker)
+scheduler
+  }
+
+  def createTaskSetBlacklist(stageId: Int = 0): TaskSetBlacklist = {
+new TaskSetBlacklist(conf, stageId, clock)
+  }
+
+  test("executors can be blacklisted with only a few failures per stage") {
+// For 4 different stages, executor 1 fails a task, then executor 2 
succeeds the task,
+// and then the task set is done.  Not enough failures to blacklist 
the executor *within*
+// any particular taskset, but we still blacklist the executor overall 
eventually.
+// Also, we intentionally have a mix of task successes and failures -- 
there are even some
+// successes after the executor is blacklisted.  The idea here is 
those tasks get scheduled
+// before the executor is blacklisted.  We might get successes after 
blacklisting (because the
+// executor might be flaky but not totally broken).  But successes 
should not unblacklist the
+// executor.
+val failuresUntilBlacklisted = conf.get(config.MAX_FAILURES_PER_EXEC)
+var failuresSoFar = 0
+(0 until failuresUntilBlacklisted * 10).foreach { stageId =>
+  val taskSetBlacklist = createTaskSetBlacklist(stageId)
+  if (stageId % 2 == 0) {
+// fail every other task
+taskSetBlacklist.updateBlacklistForFailedTask("hostA", exec = "1", 
index = 0)
+failuresSoFar += 1
+  }
+  blacklist.updateBlacklistForSuccessfulTaskSet(stageId, 0, 
taskSetBlacklist.execToFailures)
+  assert(failuresSoFar == stageId / 2 + 1)
+  if (failuresSoFar < failuresUntilBlacklisted) {
+assertEquivalentToSet(blacklist.isExecutorBlacklisted(_), Set())
+  } else {
+assertEquivalentToSet(blacklist.isExecutorBlacklisted(_), Set("1"))
+  }
+}
+  }
+
+  // If an executor has many task failures, but the task set ends up 
failing, it shouldn't be
+  // counted against the executor.
+  test("executors aren't blacklisted if task sets fail") {
--- End diff --

"executors aren't blacklisted as a result of tasks in failed task sets"?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastr

[GitHub] spark issue #16258: [SPARK-18834][SS] Expose event time stats through Stream...

2016-12-12 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16258
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70058/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16258: [SPARK-18834][SS] Expose event time stats through Stream...

2016-12-12 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16258
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16247: [SPARK-18817][SparkR] set default spark-warehouse path t...

2016-12-12 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/16247
  
a new property for default warehouse LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16258: [SPARK-18834][SS] Expose event time stats through Stream...

2016-12-12 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16258
  
**[Test build #70058 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70058/consoleFull)**
 for PR 16258 at commit 
[`b59ab80`](https://github.com/apache/spark/commit/b59ab8083de3f2441133fed35658dea39cd4a759).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15915: [SPARK-18485][CORE] Underlying integer overflow when cre...

2016-12-12 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15915
  
**[Test build #70063 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70063/consoleFull)**
 for PR 15915 at commit 
[`041992f`](https://github.com/apache/spark/commit/041992fa467d1475eb5b9ec7303b22f21ccf97ea).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15915: [SPARK-18485][CORE] Underlying integer overflow w...

2016-12-12 Thread uncleGen

Github user uncleGen commented on a diff in the pull request:

https://github.com/apache/spark/pull/15915#discussion_r92103662
  
--- Diff: 
core/src/main/scala/org/apache/spark/util/io/ChunkedByteBufferOutputStream.scala
 ---
@@ -30,9 +31,14 @@ import org.apache.spark.storage.StorageUtils
  * @param chunkSize size of each chunk, in bytes.
  */
 private[spark] class ChunkedByteBufferOutputStream(
-chunkSize: Int,
+var chunkSize: Int,
 allocator: Int => ByteBuffer)
-  extends OutputStream {
+  extends OutputStream with Logging{
+
+  if (chunkSize < 0) {
+logWarning(s"chunkSize should not be an negative value, replaced as 
4MB default.")
+chunkSize = 4 * 1024 * 1024
+  }
 
--- End diff --

As `chunkSize` is passed from many code path, and there is underlying 
integer overflow when convert from `Long` to `Int`. As we do not have a better 
solution, introducing a protection check may be a tradeoff way in the first 
step.

@JoshRosen @srowen 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16263: [SPARK-18281][SQL][PySpark] Consumes the returned local ...

2016-12-12 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16263
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16263: [SPARK-18281][SQL][PySpark] Consumes the returned local ...

2016-12-12 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16263
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70062/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16262: [SPARK-17932][SQL][FOLLOWUP] Change statement `SHOW TABL...

2016-12-12 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16262
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16263: [SPARK-18281][SQL][PySpark] Consumes the returned local ...

2016-12-12 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16263
  
**[Test build #70062 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70062/consoleFull)**
 for PR 16263 at commit 
[`6905a70`](https://github.com/apache/spark/commit/6905a700376b2deff77ff539400951cf5e12885d).
 * This patch **fails Python style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16142: [SPARK-18716][CORE] Restrict the disk usage of spark eve...

2016-12-12 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16142
  
**[Test build #70059 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70059/consoleFull)**
 for PR 16142 at commit 
[`c54c7f3`](https://github.com/apache/spark/commit/c54c7f3a1a021ab8d6d47990921d15f5e202b9ed).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16142: [SPARK-18716][CORE] Restrict the disk usage of spark eve...

2016-12-12 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16142
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70059/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16262: [SPARK-17932][SQL][FOLLOWUP] Change statement `SHOW TABL...

2016-12-12 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16262
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70056/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16263: [SPARK-18281][SQL][PySpark] Consumes the returned local ...

2016-12-12 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16263
  
**[Test build #70062 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70062/consoleFull)**
 for PR 16263 at commit 
[`6905a70`](https://github.com/apache/spark/commit/6905a700376b2deff77ff539400951cf5e12885d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16142: [SPARK-18716][CORE] Restrict the disk usage of spark eve...

2016-12-12 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16142
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16262: [SPARK-17932][SQL][FOLLOWUP] Change statement `SHOW TABL...

2016-12-12 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16262
  
**[Test build #70056 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70056/consoleFull)**
 for PR 16262 at commit 
[`0c875c0`](https://github.com/apache/spark/commit/0c875c00eeb927a47b7d8eb47c9c0c8f27111d47).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16263: [SPARK-18281][SQL][PySpark] Consumes the returned...

2016-12-12 Thread viirya

GitHub user viirya opened a pull request:

https://github.com/apache/spark/pull/16263

[SPARK-18281][SQL][PySpark] Consumes the returned local iterator 
immediately to prevent timeout on the socket serving the data

## What changes were proposed in this pull request?

There is a timeout failure when using `rdd.toLocalIterator()` or 
`df.toLocalIterator()` for a PySpark RDD and DataFrame:

df = spark.createDataFrame([[1],[2],[3]])
it = df.toLocalIterator()
row = next(it)

The cause of this issue is, we open a socket to serve the data from JVM 
side. We set a timeout for the socket to accept connection. If we don't consume 
the returned local iterator from `toLocalIterator` in Python immediately, the 
socket will be timeout and failed.

## How was this patch tested?

Added tests into PySpark.

Please review http://spark.apache.org/contributing.html before opening a 
pull request.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/viirya/spark-1 fix-pyspark-localiterator

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16263.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16263


commit 6905a700376b2deff77ff539400951cf5e12885d
Author: Liang-Chi Hsieh 
Date:   2016-12-13T03:47:10Z

Consumes the returned local iterator immediately to prevent timeout on the 
socket serving the data.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15717: [SPARK-17910][SQL] Allow users to update the comm...

2016-12-12 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/15717#discussion_r92102030
  
--- Diff: sql/core/src/test/resources/sql-tests/inputs/change-column.sql ---
@@ -0,0 +1,53 @@
+-- Create the origin table
+CREATE TABLE test_change(a Int, b String, c Int);
--- End diff --

nit: in SQL statement, we should upper case the type string, i.e. `STRING`, 
`INT`, etc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15717: [SPARK-17910][SQL] Allow users to update the comm...

2016-12-12 Thread jiangxb1987

Github user jiangxb1987 commented on a diff in the pull request:

https://github.com/apache/spark/pull/15717#discussion_r92101718
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala ---
@@ -274,6 +274,80 @@ case class AlterTableUnsetPropertiesCommand(
 
 }
 
+
+/**
+ * A command to change the columns for a table, only support changing the 
comments of non-partition
+ * columns for now.
+ *
+ * The syntax of using this command in SQL is:
+ * {{{
+ *   ALTER TABLE table_identifier
+ *   CHANGE [COLUMN] column_old_name column_new_name column_dataType 
[COMMENT column_comment]
+ *   [FIRST | AFTER column_name];
+ * }}}
+ */
+case class AlterTableChangeColumnsCommand(
+tableName: TableIdentifier,
+columns: Map[String, StructField]) extends RunnableCommand {
+
+  // TODO: support change column name/dataType/metadata/position.
+  override def run(sparkSession: SparkSession): Seq[Row] = {
+val catalog = sparkSession.sessionState.catalog
+val table = catalog.getTableMetadata(tableName)
+val resolver = sparkSession.sessionState.conf.resolver
+DDLUtils.verifyAlterTableType(catalog, table, isView = false)
+
+// Create a map that converts the origin column to the new column with 
changed comment, throw
+// a Exception if the column reference is invalid or the column 
name/dataType is changed.
+val columnsMap = columns.map { case (oldName: String, newField: 
StructField) =>
+  // Find the origin column from schema by column name.
+  val originColumn = findColumn(table.schema, oldName, resolver)
+  // Throw a Exception if the column name/dataType is changed.
+  if (!columnEqual(originColumn, newField, resolver)) {
+throw new AnalysisException(
+  "ALTER TABLE CHANGE COLUMN is not supported for changing column 
" +
+s"'${getDesc(originColumn)}' to '${getDesc(newField)}'")
+  }
+  // Create a new column from the origin column with new comment.
+  val newColumn = addComment(originColumn, newField.getComment)
+  // Create the map from origin column to changed column
+  originColumn -> newColumn
--- End diff --

+1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15717: [SPARK-17910][SQL] Allow users to update the comm...

2016-12-12 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/15717#discussion_r92101553
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala ---
@@ -274,6 +274,80 @@ case class AlterTableUnsetPropertiesCommand(
 
 }
 
+
+/**
+ * A command to change the columns for a table, only support changing the 
comments of non-partition
+ * columns for now.
+ *
+ * The syntax of using this command in SQL is:
+ * {{{
+ *   ALTER TABLE table_identifier
+ *   CHANGE [COLUMN] column_old_name column_new_name column_dataType 
[COMMENT column_comment]
+ *   [FIRST | AFTER column_name];
+ * }}}
+ */
+case class AlterTableChangeColumnsCommand(
+tableName: TableIdentifier,
+columns: Map[String, StructField]) extends RunnableCommand {
+
+  // TODO: support change column name/dataType/metadata/position.
+  override def run(sparkSession: SparkSession): Seq[Row] = {
+val catalog = sparkSession.sessionState.catalog
+val table = catalog.getTableMetadata(tableName)
+val resolver = sparkSession.sessionState.conf.resolver
+DDLUtils.verifyAlterTableType(catalog, table, isView = false)
+
+// Create a map that converts the origin column to the new column with 
changed comment, throw
+// a Exception if the column reference is invalid or the column 
name/dataType is changed.
+val columnsMap = columns.map { case (oldName: String, newField: 
StructField) =>
+  // Find the origin column from schema by column name.
+  val originColumn = findColumn(table.schema, oldName, resolver)
+  // Throw a Exception if the column name/dataType is changed.
+  if (!columnEqual(originColumn, newField, resolver)) {
+throw new AnalysisException(
+  "ALTER TABLE CHANGE COLUMN is not supported for changing column 
" +
+s"'${getDesc(originColumn)}' to '${getDesc(newField)}'")
+  }
+  // Create a new column from the origin column with new comment.
+  val newColumn = addComment(originColumn, newField.getComment)
+  // Create the map from origin column to changed column
+  originColumn -> newColumn
--- End diff --

how about use `originColumn.name` as map key?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15717: [SPARK-17910][SQL] Allow users to update the comm...

2016-12-12 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/15717#discussion_r92101519
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala ---
@@ -274,6 +274,80 @@ case class AlterTableUnsetPropertiesCommand(
 
 }
 
+
+/**
+ * A command to change the columns for a table, only support changing the 
comments of non-partition
+ * columns for now.
+ *
+ * The syntax of using this command in SQL is:
+ * {{{
+ *   ALTER TABLE table_identifier
+ *   CHANGE [COLUMN] column_old_name column_new_name column_dataType 
[COMMENT column_comment]
+ *   [FIRST | AFTER column_name];
+ * }}}
+ */
+case class AlterTableChangeColumnsCommand(
+tableName: TableIdentifier,
+columns: Map[String, StructField]) extends RunnableCommand {
+
+  // TODO: support change column name/dataType/metadata/position.
+  override def run(sparkSession: SparkSession): Seq[Row] = {
+val catalog = sparkSession.sessionState.catalog
+val table = catalog.getTableMetadata(tableName)
+val resolver = sparkSession.sessionState.conf.resolver
+DDLUtils.verifyAlterTableType(catalog, table, isView = false)
+
+// Create a map that converts the origin column to the new column with 
changed comment, throw
+// a Exception if the column reference is invalid or the column 
name/dataType is changed.
+val columnsMap = columns.map { case (oldName: String, newField: 
StructField) =>
+  // Find the origin column from schema by column name.
+  val originColumn = findColumn(table.schema, oldName, resolver)
+  // Throw a Exception if the column name/dataType is changed.
+  if (!columnEqual(originColumn, newField, resolver)) {
+throw new AnalysisException(
+  "ALTER TABLE CHANGE COLUMN is not supported for changing column 
" +
+s"'${getDesc(originColumn)}' to '${getDesc(newField)}'")
+  }
+  // Create a new column from the origin column with new comment.
+  val newColumn = addComment(originColumn, newField.getComment)
+  // Create the map from origin column to changed column
+  originColumn -> newColumn
+}
+
+val newSchema = table.schema.fields.map(field => 
columnsMap.getOrElse(field, field))
--- End diff --

ah i see


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15996: [SPARK-18567][SQL] Simplify CreateDataSourceTableAsSelec...

2016-12-12 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15996
  
**[Test build #70061 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70061/consoleFull)**
 for PR 15996 at commit 
[`172f6eb`](https://github.com/apache/spark/commit/172f6eb5eeb36819aaf731c547540c5af90c49cc).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15995: [SPARK-18566][SQL] remove OverwriteOptions

2016-12-12 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15995
  
**[Test build #70060 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70060/consoleFull)**
 for PR 15995 at commit 
[`ed548e6`](https://github.com/apache/spark/commit/ed548e6437d64e67545f6bcb60384eb8badb4cec).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16142: [SPARK-18716][CORE] Restrict the disk usage of spark eve...

2016-12-12 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16142
  
**[Test build #70059 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70059/consoleFull)**
 for PR 16142 at commit 
[`c54c7f3`](https://github.com/apache/spark/commit/c54c7f3a1a021ab8d6d47990921d15f5e202b9ed).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16142: [SPARK-18716][CORE] Restrict the disk usage of spark eve...

2016-12-12 Thread uncleGen

Github user uncleGen commented on the issue:

https://github.com/apache/spark/pull/16142
  
@vanzin I have removed related changes in `EventLoggingListener`, and 
provide a new clean mode, i.e. `space` based mode. Please take a review.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15717: [SPARK-17910][SQL] Allow users to update the comm...

2016-12-12 Thread jiangxb1987

Github user jiangxb1987 commented on a diff in the pull request:

https://github.com/apache/spark/pull/15717#discussion_r92100070
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala ---
@@ -274,6 +274,80 @@ case class AlterTableUnsetPropertiesCommand(
 
 }
 
+
+/**
+ * A command to change the columns for a table, only support changing the 
comments of non-partition
+ * columns for now.
+ *
+ * The syntax of using this command in SQL is:
+ * {{{
+ *   ALTER TABLE table_identifier
+ *   CHANGE [COLUMN] column_old_name column_new_name column_dataType 
[COMMENT column_comment]
+ *   [FIRST | AFTER column_name];
+ * }}}
+ */
+case class AlterTableChangeColumnsCommand(
+tableName: TableIdentifier,
+columns: Map[String, StructField]) extends RunnableCommand {
+
+  // TODO: support change column name/dataType/metadata/position.
+  override def run(sparkSession: SparkSession): Seq[Row] = {
+val catalog = sparkSession.sessionState.catalog
+val table = catalog.getTableMetadata(tableName)
+val resolver = sparkSession.sessionState.conf.resolver
+DDLUtils.verifyAlterTableType(catalog, table, isView = false)
+
+// Create a map that converts the origin column to the new column with 
changed comment, throw
+// a Exception if the column reference is invalid or the column 
name/dataType is changed.
+val columnsMap = columns.map { case (oldName: String, newField: 
StructField) =>
+  // Find the origin column from schema by column name.
+  val originColumn = findColumn(table.schema, oldName, resolver)
+  // Throw a Exception if the column name/dataType is changed.
+  if (!columnEqual(originColumn, newField, resolver)) {
+throw new AnalysisException(
+  "ALTER TABLE CHANGE COLUMN is not supported for changing column 
" +
+s"'${getDesc(originColumn)}' to '${getDesc(newField)}'")
+  }
+  // Create a new column from the origin column with new comment.
+  val newColumn = addComment(originColumn, newField.getComment)
+  // Create the map from origin column to changed column
+  originColumn -> newColumn
+}
+
+val newSchema = table.schema.fields.map(field => 
columnsMap.getOrElse(field, field))
--- End diff --

If `newColumn` is empty, we should throw a Exception like that HIVE does.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15995: [SPARK-18566][SQL] remove OverwriteOptions

2016-12-12 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/15995#discussion_r92099484
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala
 ---
@@ -128,61 +128,69 @@ case class DataSourceAnalysis(conf: CatalystConf) 
extends Rule[LogicalPlan] {
 projectList
   }
 
+  private def hasBeenPreprocessed(
+  tableOutput: Seq[Attribute],
+  partSchema: StructType,
+  partSpec: Map[String, Option[String]],
+  query: LogicalPlan): Boolean = {
+val partColNames = partSchema.map(_.name).toSet
+query.resolved && partSpec.keys.forall(partColNames.contains) && {
+  val staticPartCols = partSpec.filter(_._2.isDefined).keySet
+  val expectedColumns = tableOutput.filterNot(a => 
staticPartCols.contains(a.name))
+  expectedColumns.toStructType.sameType(query.schema)
--- End diff --

this is to follow the previous condition: 
https://github.com/apache/spark/pull/15995/files#diff-d99813bd5bbc18277e4090475e4944cfL166

This can be caused if users issue an invalid command, e.g. `INSERT INTO src 
SELECT 1,2` while table `src` has 3 columns.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15995: [SPARK-18566][SQL] remove OverwriteOptions

2016-12-12 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/15995#discussion_r92099139
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala
 ---
@@ -128,61 +128,69 @@ case class DataSourceAnalysis(conf: CatalystConf) 
extends Rule[LogicalPlan] {
 projectList
   }
 
+  private def hasBeenPreprocessed(
+  tableOutput: Seq[Attribute],
+  partSchema: StructType,
+  partSpec: Map[String, Option[String]],
+  query: LogicalPlan): Boolean = {
+val partColNames = partSchema.map(_.name).toSet
+query.resolved && partSpec.keys.forall(partColNames.contains) && {
--- End diff --

yup


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16252: [SPARK-18827][Core] Fix cannot read broadcast on disk

2016-12-12 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16252
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70055/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16252: [SPARK-18827][Core] Fix cannot read broadcast on disk

2016-12-12 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16252
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16252: [SPARK-18827][Core] Fix cannot read broadcast on disk

2016-12-12 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16252
  
**[Test build #70055 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70055/consoleFull)**
 for PR 16252 at commit 
[`d964c54`](https://github.com/apache/spark/commit/d964c540974e680fc2809cd40a25ef72992a17db).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16194: [SPARK-18767][ML] Unify Models' toString methods

2016-12-12 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16194
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16194: [SPARK-18767][ML] Unify Models' toString methods

2016-12-12 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16194
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70057/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16194: [SPARK-18767][ML] Unify Models' toString methods

2016-12-12 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16194
  
**[Test build #70057 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70057/consoleFull)**
 for PR 16194 at commit 
[`3b7ddfe`](https://github.com/apache/spark/commit/3b7ddfe02a00e7722b6ec41aef5ea1e69f738561).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pyspark

2016-12-12 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13599
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70054/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pyspark

2016-12-12 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13599
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pyspark

2016-12-12 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13599
  
**[Test build #70054 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70054/consoleFull)**
 for PR 13599 at commit 
[`131ae1f`](https://github.com/apache/spark/commit/131ae1f66f753ae4de0d477aee029fe2103d270f).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16258: [SPARK-18834][SS] Expose event time stats through Stream...

2016-12-12 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16258
  
**[Test build #70058 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70058/consoleFull)**
 for PR 16258 at commit 
[`b59ab80`](https://github.com/apache/spark/commit/b59ab8083de3f2441133fed35658dea39cd4a759).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16194: [SPARK-18767][ML] Unify Models' toString methods

2016-12-12 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16194
  
**[Test build #70057 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70057/consoleFull)**
 for PR 16194 at commit 
[`3b7ddfe`](https://github.com/apache/spark/commit/3b7ddfe02a00e7722b6ec41aef5ea1e69f738561).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 4 5 >

1 - 100 of 464 matches

Mail list logo