date:20150701

[GitHub] spark pull request: [SPARK-8598] [MLlib] Implementation of 1-sampl...

2015-07-01 Thread mengxr

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/6994#issuecomment-117808024
  
@josepablocam In the PR description, please also include a summary of this 
PR because it becomes part of the commit message.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6287][MESOS] Add dynamic allocation to ...

2015-07-01 Thread dragos

Github user dragos commented on a diff in the pull request:

https://github.com/apache/spark/pull/4984#discussion_r33718904
  
--- Diff: 
core/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala ---
@@ -124,10 +124,16 @@ private[spark] class DiskBlockManager(blockManager: 
BlockManager, conf: SparkCon
 (blockId, getFile(blockId))
   }
 
+  /**
+   * Create local directories for storing block data. These directories are
+   * located inside configured local directories and won't
+   * be deleted on JVM exit when using the external shuffle service.
--- End diff --

Can you please read [my 
comment](https://github.com/apache/spark/pull/4984#issuecomment-117351436) 
again? I tried to explain to the best of my ability. If that's not clear, I'm 
happy to try again, but please ask more specific questions.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8378][Streaming]Add the Python API for ...

2015-07-01 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/6830


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3071] Increase default driver memory

2015-07-01 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7132#issuecomment-117797061
  
  [Test build #36282 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36282/console)
 for   PR 7132 at commit 
[`26cc177`](https://github.com/apache/spark/commit/26cc177d6ae76248273d84cd54c20839aed9e630).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8538][SPARK-8539][ML] Linear Regression...

2015-07-01 Thread feynmanliang

Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/7099#discussion_r33713764
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/evaluation/RegressionMetrics.scala 
---
@@ -31,7 +31,8 @@ import org.apache.spark.sql.DataFrame
  * @param predictionAndObservations an RDD of (prediction, observation) 
pairs.
  */
 @Experimental
-class RegressionMetrics(predictionAndObservations: RDD[(Double, Double)]) 
extends Logging {
+class RegressionMetrics(predictionAndObservations: RDD[(Double, Double)])
+  extends Logging with Serializable {
--- End diff --

Not too sure why but I was getting `Task Not Serializable` errors without 
this. I suspect this is because everything inside the `train()` method's 
closure gets serialized.

I followed up on how lazy vals interact with serialization and found [this 
SO 
post](http://stackoverflow.com/questions/27882307/how-does-serialization-of-lazy-fields-work)
 which says that the value is serialized iff it was computed before 
serialization.

In my updated implementation, one option could be to force the evaluation 
of `RegressionMetrics.summary` in the `LinearRegressionTestResults` 
constructor. However, despite being serializable I don't expect this class to 
be replicated anywhere except at the driver so maybe this eager evaluation is 
unnecessary. @jkbradley thoughts?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4485][SQL] (1) Add broadcast hash outer...

2015-07-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7162#issuecomment-117799843
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6287][MESOS] Add dynamic allocation to ...

2015-07-01 Thread andrewor14

Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/4984#discussion_r33715021
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -1241,8 +1253,37 @@ private[spark] class BlockManager(
 futureExecutionContext.shutdownNow()
 logInfo(BlockManager stopped)
   }
-}
 
+  /**
+   * Contact all external shuffle services and de-register this 
application, cleaning
+   * shuffle files as well.
+   *
+   * This is required in Mesos mode with dynamic allocation, since 
executors leave behind
+   * shuffle files to be served by the external shuffle service. We need 
to delete those
+   * files when the application stops.
+   */
+  private def cleanupAllShuffleFiles() {
--- End diff --

This is only best-effort. If the driver is killed forcefully we still have 
shuffle files lingering around.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8538][SPARK-8539][ML] Linear Regression...

2015-07-01 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7099#issuecomment-117800789
  
  [Test build #36295 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36295/consoleFull)
 for   PR 7099 at commit 
[`9509c79`](https://github.com/apache/spark/commit/9509c799347738dc58d104ef5c4ae68f4c275c84).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-8313] R Spark packages support

2015-07-01 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7139#issuecomment-117800747
  
  [Test build #36294 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36294/consoleFull)
 for   PR 7139 at commit 
[`0226768`](https://github.com/apache/spark/commit/0226768c1cddce134253b9d817098880f04de0a5).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7820][Build] Fix Java8-tests suite comp...

2015-07-01 Thread JoshRosen

Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/7120#issuecomment-117801763
  
LGTM, so I'm going to merge this into master and branch-1.4.  Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5562][MLlib] LDA should handle empty do...

2015-07-01 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/7064#discussion_r33716204
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -119,6 +120,14 @@ final class EMLDAOptimizer extends LDAOptimizer {
   }
 }
 
+// create term vertices for empty docs
+val emptyDocTermVertices: RDD[(VertexId, TopicCounts)] =
--- End diff --

rename: emptyDocVertices


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3444] [core] Restore INFO level after l...

2015-07-01 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/7140


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-746][CORE][WIP] Added Avro Serializatio...

2015-07-01 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7004#issuecomment-117805703
  
  [Test build #36298 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36298/consoleFull)
 for   PR 7004 at commit 
[`d421bf5`](https://github.com/apache/spark/commit/d421bf5e54eb4678c6d21a44d100e84fe19d94ed).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8660][ML] Convert JavaDoc style comment...

2015-07-01 Thread Rosstin

Github user Rosstin commented on a diff in the pull request:

https://github.com/apache/spark/pull/7096#discussion_r33719147
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
 ---
@@ -211,22 +211,22 @@ class LogisticRegressionSuite extends SparkFunSuite 
with MLlibTestSparkContext {
 val trainer = (new LogisticRegression).setFitIntercept(true)
 val model = trainer.fit(binaryDataset)
 
-/**
- * Using the following R code to load the data and train the model 
using glmnet package.
- *
- *  library(glmnet)
- *  data - read.csv(path, header=FALSE)
- *  label = factor(data$V1)
- *  features = as.matrix(data.frame(data$V2, data$V3, data$V4, 
data$V5))
- *  weights = coef(glmnet(features,label, family=binomial, alpha = 
0, lambda = 0))
- *  weights
- * 5 x 1 sparse Matrix of class dgCMatrix
- * s0
- * (Intercept)  2.8366423
- * data.V2 -0.5895848
- * data.V3  0.8931147
- * data.V4 -0.3925051
- * data.V5 -0.7996864
+/*
+   Using the following R code to load the data and train the model 
using glmnet package.
+
+library(glmnet)
--- End diff --

Sure, no problem.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5016][MLLib] Distribute GMM mixture com...

2015-07-01 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7166#issuecomment-117817655
  
  [Test build #36301 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36301/consoleFull)
 for   PR 7166 at commit 
[`1da3c7f`](https://github.com/apache/spark/commit/1da3c7f609f55d2be3c95d9265d256f6d2dc669f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5016][MLLib] Distribute GMM mixture com...

2015-07-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7166#issuecomment-117817442
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8538][SPARK-8539][ML] Linear Regression...

2015-07-01 Thread feynmanliang

Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/7099#discussion_r33712588
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala ---
@@ -193,11 +199,36 @@ class LinearRegression(override val uid: String)
 val intercept = if ($(fitIntercept)) yMean - dot(weights, 
Vectors.dense(featuresMean)) else 0.0
 if (handlePersistence) instances.unpersist()
 
+val summary = generateTrainingResults(instances, lossHistory.result(), 
weights, intercept)
+
 // TODO: Converts to sparse format based on the storage, but may base 
on the scoring speed.
-copyValues(new LinearRegressionModel(uid, weights.compressed, 
intercept))
+copyValues(new LinearRegressionModel(uid, weights.compressed, 
intercept, summary))
+  }
+
+  private def generateTrainingResults(
--- End diff --

OK.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8598] [MLlib] Implementation of 1-sampl...

2015-07-01 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/6994#discussion_r33713902
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/Statistics.scala 
---
@@ -158,4 +158,25 @@ object Statistics {
   def chiSqTest(data: RDD[LabeledPoint]): Array[ChiSqTestResult] = {
 ChiSqTest.chiSquaredFeatures(data)
   }
+
+  /**
+   * Conduct a one-sample, two sided Kolmogorov Smirnov test for 
probability distribution equality
+   * @param data an `RDD[Double]` containing the sample of data to test
+   * @param cdf a `Double = Double` function to calculate the theoretical 
CDF at a given value
+   * @return KSTestResult object containing test statistic, p-value, and 
null hypothesis.
+   */
+  def ksTest(data: RDD[Double], cdf: Double = Double): KSTestResult = {
+KSTest.testOneSample(data, cdf)
+  }
+
+  /**
+   * Convenience function to conduct a one-sample, two sided Kolmogorov 
Smirnov test for probability
+   * distribution equality. Currently supports standard normal 
distribution only.
+   * @param data an `RDD[Double]` containing the sample of data to test
+   * @param name a `String` name for a theoretical distribution
--- End diff --

`name` - `dist` or `distName`? It is not clear what `name` means. You 
mentioned only standard normal distribution is supported but forgot to provide 
its corresponding distribution name in the doc. It is hard to guess `stdnorm` 
unless looking into the code.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8538][SPARK-8539][ML] Linear Regression...

2015-07-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7099#issuecomment-117799304
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8538][SPARK-8539][ML] Linear Regression...

2015-07-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7099#issuecomment-117799319
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6287][MESOS] Add dynamic allocation to ...

2015-07-01 Thread andrewor14

Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/4984#discussion_r33714143
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -1223,6 +1232,9 @@ private[spark] class BlockManager(
   }
 
   def stop(): Unit = {
+if ((blockManagerId ne null)  blockManagerId.isDriver) {
--- End diff --

elsewhere in Spark we just do `if (blockManagerId != null  
blockManagerId.isDriver)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4072][Core]Display Streaming blocks in ...

2015-07-01 Thread tdas

Github user tdas commented on the pull request:

https://github.com/apache/spark/pull/6672#issuecomment-117799361
  
@JoshRosen Any more thoughts on this PR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8660][ML] Convert JavaDoc style comment...

2015-07-01 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/7096#discussion_r33714571
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
 ---
@@ -211,22 +211,22 @@ class LogisticRegressionSuite extends SparkFunSuite 
with MLlibTestSparkContext {
 val trainer = (new LogisticRegression).setFitIntercept(true)
 val model = trainer.fit(binaryDataset)
 
-/**
- * Using the following R code to load the data and train the model 
using glmnet package.
- *
- *  library(glmnet)
- *  data - read.csv(path, header=FALSE)
- *  label = factor(data$V1)
- *  features = as.matrix(data.frame(data$V2, data$V3, data$V4, 
data$V5))
- *  weights = coef(glmnet(features,label, family=binomial, alpha = 
0, lambda = 0))
- *  weights
- * 5 x 1 sparse Matrix of class dgCMatrix
- * s0
- * (Intercept)  2.8366423
- * data.V2 -0.5895848
- * data.V3  0.8931147
- * data.V4 -0.3925051
- * data.V5 -0.7996864
+/*
+   Using the following R code to load the data and train the model 
using glmnet package.
+
+library(glmnet)
--- End diff --

@Rosstin Could you also remove ``s? Then people can easily copy  paste.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8660][ML] Convert JavaDoc style comment...

2015-07-01 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/7096#discussion_r33714575
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
 ---
@@ -211,22 +211,22 @@ class LogisticRegressionSuite extends SparkFunSuite 
with MLlibTestSparkContext {
 val trainer = (new LogisticRegression).setFitIntercept(true)
 val model = trainer.fit(binaryDataset)
 
-/**
- * Using the following R code to load the data and train the model 
using glmnet package.
- *
- *  library(glmnet)
- *  data - read.csv(path, header=FALSE)
- *  label = factor(data$V1)
- *  features = as.matrix(data.frame(data$V2, data$V3, data$V4, 
data$V5))
- *  weights = coef(glmnet(features,label, family=binomial, alpha = 
0, lambda = 0))
- *  weights
- * 5 x 1 sparse Matrix of class dgCMatrix
- * s0
- * (Intercept)  2.8366423
- * data.V2 -0.5895848
- * data.V3  0.8931147
- * data.V4 -0.3925051
- * data.V5 -0.7996864
+/*
+   Using the following R code to load the data and train the model 
using glmnet package.
+
+library(glmnet)
+data - read.csv(path, header=FALSE)
+label = factor(data$V1)
+features = as.matrix(data.frame(data$V2, data$V3, data$V4, 
data$V5))
+weights = coef(glmnet(features,label, family=binomial, alpha = 
0, lambda = 0))
+weights
--- End diff --

If we remove ``, please insert an empty line to separate commands and 
results.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6287][MESOS] Add dynamic allocation to ...

2015-07-01 Thread andrewor14

Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/4984#discussion_r33714625
  
--- Diff: 
network/shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleClient.java
 ---
@@ -138,6 +139,26 @@ public void registerWithShuffleServer(
 client.sendRpcSync(registerMessage, 5000 /* timeoutMs */);
   }
 
+  /**
+   * Removes this application from the external shuffle server and 
optionally deletes local
+   * files.
+   *
+   * @param host Host of the shuffle server.
+   * @param port Port of the shuffle server.
+   * @param cleanupLocalDirs True if corresponding shuffle files should be 
deleted
+   * @throws IOException
+   */
+  public void applicationRemoved(String host,
+  int port,
--- End diff --

This is really strange. When I have a callback called `applicationRemoved` 
I would expect to see an appID, not a host and a port. All of that should be 
transparent to the caller and handled internally right? The other thing is that 
there's already similar logic in 
`ExternalShuffleBlockResolver#applicationRemoved`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8538][SPARK-8539][ML] Linear Regression...

2015-07-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7099#issuecomment-117800161
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6287][MESOS] Add dynamic allocation to ...

2015-07-01 Thread dragos

Github user dragos commented on a diff in the pull request:

https://github.com/apache/spark/pull/4984#discussion_r33719412
  
--- Diff: 
network/shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleClient.java
 ---
@@ -138,6 +139,26 @@ public void registerWithShuffleServer(
 client.sendRpcSync(registerMessage, 5000 /* timeoutMs */);
   }
 
+  /**
+   * Removes this application from the external shuffle server and 
optionally deletes local
+   * files.
+   *
+   * @param host Host of the shuffle server.
+   * @param port Port of the shuffle server.
+   * @param cleanupLocalDirs True if corresponding shuffle files should be 
deleted
+   * @throws IOException
+   */
+  public void applicationRemoved(String host,
+  int port,
--- End diff --

Note that other (all?) methods in this class take a host and a port. This 
is not a callback. This is a client wrapper for connecting to the 
ExternalShuffleClient. They can be on any executor. How would you tell the 
client where to connect to?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5016] Distribute GMM mixture components...

2015-07-01 Thread feynmanliang

GitHub user feynmanliang opened a pull request:

https://github.com/apache/spark/pull/7166

[SPARK-5016] Distribute GMM mixture components to executors

Distribute expensive portions of computation for Gaussian mixture 
components (in particular, pre-computation of 
`MultivariateGaussian.rootSigmaInv`, the inverse covariance matrix and 
covariance determinant) across executors. Repost of PR#4654.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/feynmanliang/spark GMM_parallel_mixtures

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/7166.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #7166


commit 1da3c7f609f55d2be3c95d9265d256f6d2dc669f
Author: Feynman Liang fli...@databricks.com
Date:   2015-06-30T22:58:58Z

Distribute mixtures




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8647][MLlib] Potential issue with const...

2015-07-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7146#issuecomment-117798423
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8538][SPARK-8539][ML] Linear Regression...

2015-07-01 Thread feynmanliang

Github user feynmanliang commented on the pull request:

https://github.com/apache/spark/pull/7099#issuecomment-117798935
  
It wasn't clear to me how to use DFs in the result classes I was creating 
since they didn't have access to the model parameters (featuresCol, 
predictionCol, etc). I could add them as constructor params if you think 
that'll be better but it's not clear to me what the benefit of using a DF in 
the result classes is since most use cases will only be interested in the 
summary functions rather than the predictions + labels themselves.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6287][MESOS] Add dynamic allocation to ...

2015-07-01 Thread andrewor14

Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/4984#discussion_r33714016
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -111,12 +113,19 @@ private[spark] class BlockManager(
 
   // Client to read other executors' shuffle files. This is either an 
external service, or just the
   // standard BlockTransferService to directly connect to other Executors.
-  private[spark] val shuffleClient = if (externalShuffleServiceEnabled) {
-val transConf = SparkTransportConf.fromSparkConf(conf, numUsableCores)
-new ExternalShuffleClient(transConf, securityManager, 
securityManager.isAuthenticationEnabled(),
-  securityManager.isSaslEncryptionEnabled())
-  } else {
-blockTransferService
+  private[spark] val shuffleClient = 
mockShuffleClient.getOrElse(createShuffleClient)
+
+  private def createShuffleClient: ShuffleClient = {
--- End diff --

This doesn't need to be a method. It's probably fine to just do
```
private[spark] val shuffleClient = mockShuffleClient.getOrElse {
  if (externalShuffleServiceEnabled) ...
}
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [ML][Minor] update transformSchema methods of ...

2015-07-01 Thread jkbradley

Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/6433#issuecomment-117799019
  
@RoyGao Could you please create a JIRA and add it to this PR's title?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6287][MESOS] Add dynamic allocation to ...

2015-07-01 Thread andrewor14

Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/4984#discussion_r33714710
  
--- Diff: 
core/src/test/scala/org/apache/spark/scheduler/mesos/CoarseMesosSchedulerBackendSuite.scala
 ---
@@ -0,0 +1,187 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.scheduler.mesos
+
+import java.util
+import java.util.Collections
+
+import scala.collection.mutable
+
+import akka.actor.ActorSystem
+
+import com.typesafe.config.Config
+
+import org.apache.mesos.Protos.Value.Scalar
+import org.apache.mesos.Protos._
+import org.apache.mesos.SchedulerDriver
+import org.apache.mesos.MesosSchedulerDriver
+import org.apache.spark.scheduler.TaskSchedulerImpl
+import org.apache.spark.scheduler.cluster.mesos.{ 
CoarseMesosSchedulerBackend, MemoryUtils }
+import org.apache.spark.{ LocalSparkContext, SparkConf, SparkEnv, 
SparkContext }
+
+import org.mockito.Matchers._
+import org.mockito.Mockito._
+import org.mockito.{ ArgumentCaptor, Matchers }
+
+import org.scalatest.FunSuite
+import org.scalatest.mock.MockitoSugar
+import org.scalatest.BeforeAndAfter
+
+class CoarseMesosSchedulerBackendSuite extends FunSuite
+with LocalSparkContext
+with MockitoSugar
+with BeforeAndAfter {
+
+  private def createOffer(offerId: String, slaveId: String, mem: Int, cpu: 
Int): Offer = {
+val builder = Offer.newBuilder()
+builder.addResourcesBuilder()
+  .setName(mem)
+  .setType(Value.Type.SCALAR)
+  .setScalar(Scalar.newBuilder().setValue(mem))
+builder.addResourcesBuilder()
+  .setName(cpus)
+  .setType(Value.Type.SCALAR)
+  .setScalar(Scalar.newBuilder().setValue(cpu))
+builder.setId(OfferID.newBuilder()
+  .setValue(offerId).build())
+  .setFrameworkId(FrameworkID.newBuilder()
+.setValue(f1))
+  .setSlaveId(SlaveID.newBuilder().setValue(slaveId))
+  .setHostname(shost${slaveId})
+  .build()
+  }
+
+  private def createSchedulerBackend(taskScheduler: TaskSchedulerImpl,
+driver: SchedulerDriver): CoarseMesosSchedulerBackend = {
--- End diff --

style:
```
private def createSchedulerBackend(
taskScheduler: TaskSchedulerImpl,
driver: SchedulerDriver): CoarseMesosSchedulerBackend = {
  ...
}
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-8313] R Spark packages support

2015-07-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7139#issuecomment-117800154
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-8313] R Spark packages support

2015-07-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7139#issuecomment-117800176
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8538][SPARK-8539][ML] Linear Regression...

2015-07-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7099#issuecomment-117800175
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6287][MESOS] Add dynamic allocation to ...

2015-07-01 Thread andrewor14

Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/4984#discussion_r33714772
  
--- Diff: 
core/src/test/scala/org/apache/spark/scheduler/mesos/CoarseMesosSchedulerBackendSuite.scala
 ---
@@ -0,0 +1,187 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.scheduler.mesos
+
+import java.util
+import java.util.Collections
+
+import scala.collection.mutable
+
+import akka.actor.ActorSystem
+
+import com.typesafe.config.Config
+
+import org.apache.mesos.Protos.Value.Scalar
+import org.apache.mesos.Protos._
+import org.apache.mesos.SchedulerDriver
+import org.apache.mesos.MesosSchedulerDriver
+import org.apache.spark.scheduler.TaskSchedulerImpl
+import org.apache.spark.scheduler.cluster.mesos.{ 
CoarseMesosSchedulerBackend, MemoryUtils }
+import org.apache.spark.{ LocalSparkContext, SparkConf, SparkEnv, 
SparkContext }
+
+import org.mockito.Matchers._
+import org.mockito.Mockito._
+import org.mockito.{ ArgumentCaptor, Matchers }
--- End diff --

style: no space before after `{}`. Also all of these third party apps 
should just be grouped together. See other files.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-746][CORE][WIP] Added Avro Serializatio...

2015-07-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7004#issuecomment-117805393
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-746][CORE][WIP] Added Avro Serializatio...

2015-07-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7004#issuecomment-117805408
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-746][CORE][WIP] Added Avro Serializatio...

2015-07-01 Thread JDrit

Github user JDrit commented on a diff in the pull request:

https://github.com/apache/spark/pull/7004#discussion_r33716906
  
--- Diff: core/pom.xml ---
@@ -398,6 +398,40 @@
   artifactIdpy4j/artifactId
   version0.8.2.1/version
 /dependency
+dependency
+  groupIdorg.apache.avro/groupId
+  artifactIdavro/artifactId
+  version${avro.version}/version
+  scope${hadoop.deps.scope}/scope
+/dependency
+dependency
+  groupIdorg.apache.avro/groupId
+  artifactIdavro-mapred/artifactId
+  version${avro.version}/version
--- End diff --

Thanks for pointing that out, just fixed that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8598] [MLlib] Implementation of 1-sampl...

2015-07-01 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/6994#discussion_r33717906
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/stat/test/KSTest.scala ---
@@ -0,0 +1,191 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.stat.test
+
+import org.apache.commons.math3.distribution.{NormalDistribution, 
RealDistribution}
+import org.apache.commons.math3.stat.inference.KolmogorovSmirnovTest
+
+import org.apache.spark.rdd.RDD
+
+/**
+ * Conduct the two-sided Kolmogorov Smirnov test for data sampled from a
+ * continuous distribution. By comparing the largest difference between 
the empirical cumulative
+ * distribution of the sample data and the theoretical distribution we can 
provide a test for the
+ * the null hypothesis that the sample data comes from that theoretical 
distribution.
+ * For more information on KS Test: 
https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test
+ *
+ * Implementation note: We seek to implement the KS test with a minimal 
number of distributed
+ * passes. We sort the RDD, and then perform the following operations on a 
per-partition basis:
+ * calculate an empirical cumulative distribution value for each 
observation, and a theoretical
+ * cumulative distribution value. We know the latter to be correct, while 
the former will be off by
+ * a constant (how large the constant is depends on how many values 
precede it in other partitions).
+ * However, given that this constant simply shifts the ECDF upwards, but 
doesn't change its shape,
+ * and furthermore, that constant is the same within a given partition, we 
can pick 2 values
+ * in each partition that can potentially resolve to the largest global 
distance. Namely, we
+ * pick the minimum distance and the maximum distance. Additionally, we 
keep track of how many
+ * elements are in each partition. Once these three values have been 
returned for every partition,
+ * we can collect and operate locally. Locally, we can now adjust each 
distance by the appropriate
+ * constant (the cumulative sum of # of elements in the prior partitions 
divided by the data set
+ * size). Finally, we take the maximum absolute value, and this is the 
statistic.
+ */
+private[stat] object KSTest {
+
+  // Null hypothesis for the type of KS test to be included in the result.
+  object NullHypothesis extends Enumeration {
+type NullHypothesis = Value
+val oneSampleTwoSided = Value(Sample follows theoretical 
distribution.)
+  }
+
+  /**
+   * Runs a KS test for 1 set of sample data, comparing it to a 
theoretical distribution
+   * @param data `RDD[Double]` data on which to run test
+   * @param cdf `Double = Double` function to calculate the theoretical 
CDF
+   * @return KSTestResult summarizing the test results (pval, statistic, 
and null hypothesis)
+   */
+  def testOneSample(data: RDD[Double], cdf: Double = Double): 
KSTestResult = {
+val n = data.count().toDouble
+val localData = data.sortBy(x = x).mapPartitions { part =
+  val partDiffs = oneSampleDifferences(part, n, cdf) // local distances
+  searchOneSampleCandidates(partDiffs) // candidates: local extrema
+}.collect()
+val ksStat = searchOneSampleStatistic(localData, n) // result: global 
extreme
+evalOneSampleP(ksStat, n.toLong)
+  }
+
+  /**
+   * Runs a KS test for 1 set of sample data, comparing it to a 
theoretical distribution
+   * @param data `RDD[Double]` data on which to run test
+   * @param createDist `Unit = RealDistribution` function to create a 
theoretical distribution
+   * @return KSTestResult summarizing the test results (pval, statistic, 
and null hypothesis)
+   */
+  def testOneSample(data: RDD[Double], createDist: () = 
RealDistribution): KSTestResult = {
+val n = data.count().toDouble
+val localData = data.sortBy(x = x).mapPartitions { part =
+  val

[GitHub] spark pull request: [SPARK-8538][SPARK-8539][ML] Linear Regression...

2015-07-01 Thread jkbradley

Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/7099#issuecomment-117813319
  
 It wasn't clear to me how to use DFs in the result classes I was creating 
since they didn't have access to the model parameters (featuresCol, 
predictionCol, etc). I could add them as constructor params if you think 
that'll be better but it's not clear to me what the benefit of using a DF in 
the result classes is since most use cases will only be interested in the 
summary functions rather than the predictions + labels themselves.

Sorry, this was ambiguous.  The plan is to have DFs for each result type, 
not necessarily ones zipped with the transformed data.  Later on, we could 
provide extra output columns to include the values in the transformed data, but 
we won't just yet.  E.g., we can provide a DataFrame storing only 1 column of 
residuals.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8479] [MLlib] Add numNonzeros and numAc...

2015-07-01 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/6904#issuecomment-117815373
  
  [Test build #36288 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36288/console)
 for   PR 6904 at commit 
[`252c6b7`](https://github.com/apache/spark/commit/252c6b72426300fa0859c9d83c0b014b5f94bf6e).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8766] support non-ascii character in co...

2015-07-01 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7165#issuecomment-117816111
  
  [Test build #36299 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36299/consoleFull)
 for   PR 7165 at commit 
[`867754a`](https://github.com/apache/spark/commit/867754aa2b852e971c6d4359d67a469b1f709611).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8660] [MLLib] removed symbols from co...

2015-07-01 Thread Rosstin

GitHub user Rosstin opened a pull request:

https://github.com/apache/spark/pull/7167

[SPARK-8660] [MLLib] removed  symbols from comments in 
LogisticRegressionSuite.scala for ease of copypaste

'' symbols removed from comments in LogisticRegressionSuite.scala, for 
ease of copypaste

also single-lined the multiline commands (is this desirable, or does it 
violate style?)

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/Rosstin/spark SPARK-8660-2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/7167.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #7167


commit 6c18058336a1a027207d194c31134351ec3ed86d
Author: Rosstin astera...@gmail.com
Date:   2015-06-26T18:00:35Z

fixed minor typos in docs/README.md and docs/api.md

commit 21ac1e54283d633e5c4978427e03937b17c1b626
Author: Rosstin astera...@gmail.com
Date:   2015-06-29T17:06:15Z

Merge branch 'master' of github.com:apache/spark into SPARK-8639

commit 2cd298520f7018fbd2fc5174f40a984db52b68a0
Author: Rosstin astera...@gmail.com
Date:   2015-06-29T20:06:15Z

Merge branch 'master' of github.com:apache/spark into SPARK-8639

commit 242aeddcd949f23c86c5b3000e27019a901df64b
Author: Rosstin astera...@gmail.com
Date:   2015-06-29T20:18:39Z

SPARK-8660, changed comment style from JavaDoc style to normal multiline 
comment in order to make copypaste into R easier, in file 
classification/LogisticRegressionSuite.scala

commit bb9a4b19487c55c826d0a49b1987dba9b4d7d031
Author: Rosstin astera...@gmail.com
Date:   2015-06-29T20:21:17Z

Merge branch 'master' of github.com:apache/spark into SPARK-8660

commit 5a05dee9fb142e8997b85904d01a02964ed32553
Author: Rosstin astera...@gmail.com
Date:   2015-06-29T20:27:44Z

SPARK-8661 for LinearRegressionSuite.scala, changed javadoc-style comments 
to regular multiline comments to make it easier to copy-paste the R code.

commit 39ddd50ee27d80debf02cf9a5985c8bf2f4cb94c
Author: Rosstin astera...@gmail.com
Date:   2015-07-01T20:13:37Z

Merge branch 'master' of github.com:apache/spark into SPARK-8661

commit fe6b11224126adb5692292fbb8b8c1bc03c48f46
Author: Rosstin astera...@gmail.com
Date:   2015-07-01T20:40:41Z

SPARK-8660  symbols removed from LogisticRegressionSuite.scala for easy of 
copypaste




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-8703] [ML] Add CountVectorizer as a ml ...

2015-07-01 Thread feynmanliang

Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/7084#discussion_r33722239
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala ---
@@ -0,0 +1,73 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.ml.feature
+
+import scala.collection.mutable
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.ml.param.{IntParam, ParamMap, ParamValidators}
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.mllib.linalg.Vectors
+import org.apache.spark.sql.DataFrame
+import org.apache.spark.sql.functions._
+
+/**
+ * :: Experimental ::
+ * Converts a text document to a sparse vector of token counts.
+ * @param vocabulary An Array over terms. Only the terms in the vocabulary 
will be counted.
+ */
+@Experimental
+class CountVectorizer (override val uid: String, vocabulary: 
Array[String]) extends HashingTF{
+
+  def this(vocabulary: Array[String]) = 
this(Identifiable.randomUID(countVectorizer), vocabulary)
--- End diff --

This is probably fine for now, but I had some thoughts about having an 
empty constructor for including every word encountered if no vocabulary is 
provided. If it requires significant modification, we should make a separate 
JIRA for it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8660] [MLLib] removed symbols from co...

2015-07-01 Thread Rosstin

Github user Rosstin commented on the pull request:

https://github.com/apache/spark/pull/7167#issuecomment-117820353
  
@mengxr Would it be desirable to un-multiline the LOC in the file's 
comments? Or should these remain multiline to follow style? (What I mean is, 
the lines are long enough that they were being broken into multiple lines, so 
copy-pasting them would be harder. I made them back into single-line.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8425][Core][WIP] Add blacklist mechanis...

2015-07-01 Thread squito

Github user squito commented on a diff in the pull request:

https://github.com/apache/spark/pull/6870#discussion_r33723836
  
--- Diff: 
core/src/test/scala/org/apache/spark/scheduler/ExecutorBlacklistTrackerSuite.scala
 ---
@@ -0,0 +1,158 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.scheduler
+
+import scala.collection.mutable
+
+import org.scalatest.{BeforeAndAfter, PrivateMethodTester}
+
+import org.apache.spark._
+import 
org.apache.spark.scheduler.ExecutorBlacklistTracker.ExecutorFailureStatus
+import org.apache.spark.scheduler.cluster.ExecutorInfo
+import org.apache.spark.util.ManualClock
+
+class ExecutorBlacklistTrackerSuite
+  extends SparkFunSuite
+  with LocalSparkContext
+  with BeforeAndAfter {
+  import ExecutorBlacklistTrackerSuite._
+
+  before {
+if (sc == null) {
+  sc = createSparkContext
+}
+  }
+
+  after {
+if (sc !=  null) {
+  sc.stop()
+  sc = null
+}
+  }
+
+  test(add executor to blacklist) {
+// Add 5 executors
+addExecutors(5)
+val tracker = sc.executorBlacklistTracker.get
+assert(numExecutorsRegistered(tracker) === 5)
+
+// Post 5 TaskEnd event to executor-1 to add executor-1 into blacklist
+(0 until 5).foreach(_ = postTaskEndEvent(TaskResultLost, 
executor-1))
+
+assert(tracker.getExecutorBlacklist === Set(executor-1))
+assert(executorIdToTaskFailures(tracker)(executor-1).numFailures === 
5)
+assert(executorIdToTaskFailures(tracker)(executor-1).isBlackListed 
=== true)
+
+// Post 10 TaskEnd event to executor-2 to add executor-2 into blacklist
+(0 until 10).foreach(_ = postTaskEndEvent(TaskResultLost, 
executor-2))
+assert(tracker.getExecutorBlacklist === Set(executor-1, 
executor-2))
+assert(executorIdToTaskFailures(tracker)(executor-2).numFailures === 
10)
+assert(executorIdToTaskFailures(tracker)(executor-2).isBlackListed 
=== true)
+
+// Post 5 TaskEnd event to executor-3 to verify whether executor-3 is 
blacklisted
+(0 until 5).foreach(_ = postTaskEndEvent(TaskResultLost, 
executor-3))
+// Since the failure number of executor-3 is less than the average 
blacklist threshold,
+// though exceed the fault threshold, still should not be added into 
blacklist
+assert(tracker.getExecutorBlacklist === Set(executor-1, 
executor-2))
+assert(executorIdToTaskFailures(tracker)(executor-3).numFailures === 
5)
+assert(executorIdToTaskFailures(tracker)(executor-3).isBlackListed 
=== false)
+
+// Keep post TaskEnd event to executor-3 to add executor-3 into 
blacklist
+(0 until 2).foreach(_ = postTaskEndEvent(TaskResultLost, 
executor-3))
+assert(tracker.getExecutorBlacklist === Set(executor-1, 
executor-2, executor-3))
+assert(executorIdToTaskFailures(tracker)(executor-3).numFailures === 
7)
+assert(executorIdToTaskFailures(tracker)(executor-3).isBlackListed 
=== true)
+
+// Post TaskEnd event to executor-4 to verify whether executor-4 could 
be added into blacklist
+(0 until 10).foreach(_ = postTaskEndEvent(TaskResultLost, 
executor-4))
+// Event executor-4's failure task number is above than blacklist 
threshold,
+// but the blacklisted executor number is reaching to maximum fraction,
+// so executor-4 still cannot be added into blacklist.
--- End diff --

just thinking aloud -- I wonder if this is really the best we can do.  If 
there are too many executors blacklisted, does it make more sense to just kill 
the job or the spark context?   I suppose that if you really keep having a lot 
of failures past this point, eventually you'll trigger the 4 task failures on 
one node which will kill the job, so maybe this is reasonable?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and

[GitHub] spark pull request: [SPARK-8029][core] shuffleoutput per attempt

2015-07-01 Thread squito

Github user squito commented on the pull request:

https://github.com/apache/spark/pull/6648#issuecomment-117826371
  
Jenkins, retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8660] [MLLib] removed symbols from co...

2015-07-01 Thread holdenk

Github user holdenk commented on the pull request:

https://github.com/apache/spark/pull/7167#issuecomment-117826411
  
For copypasteing in Scala mode :paste mode makes the multi-line copy/past 
work well (although requires remembering that + ctrl-d)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7977] [Build] Disallowing println

2015-07-01 Thread srowen

Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/7093#discussion_r33725936
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveUDFSuite.scala 
---
@@ -253,7 +253,9 @@ class PairUDF extends GenericUDF {
   )
 
   override def evaluate(args: Array[DeferredObject]): AnyRef = {
+// scalastyle:off println
 println(Type = %s.format(args(0).getClass.getName))
--- End diff --

This is a big change. I still think a number of these printlns are debug 
leftovers and can be removed, or in some cases turned into logging. I don't 
know of a good way to review these other than to just sift through this a few 
times. I think anything in a main() method or close support of a CLI utility 
can stay; examples too are probably OK with println. Obviously there are some 
methods whose job it is to print to the console directly. Everything else, not 
sure where you would generally allow println.

So, this is one I think can be removed?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7977] [Build] Disallowing println

2015-07-01 Thread srowen

Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/7093#discussion_r33725957
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertIntoHiveTableSuite.scala
 ---
@@ -87,7 +87,9 @@ class InsertIntoHiveTableSuite extends QueryTest with 
BeforeAndAfter {
   sql(CREATE TABLE doubleCreateAndInsertTest (key int, value string))
 }.getMessage
 
+// scalastyle:off println
--- End diff --

Remove?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1537 [WiP] Application Timeline Server i...

2015-07-01 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5423#issuecomment-117828706
  
  [Test build #36303 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36303/consoleFull)
 for   PR 5423 at commit 
[`e7897ae`](https://github.com/apache/spark/commit/e7897ae1493a7fef4f51e41ca72c4e98d028eda7).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7977] [Build] Disallowing println

2015-07-01 Thread srowen

Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/7093#discussion_r33726438
  
--- Diff: core/src/test/scala/org/apache/spark/util/UtilsSuite.scala ---
@@ -684,7 +684,9 @@ class UtilsSuite extends SparkFunSuite with 
ResetSystemProperties with Logging {
 val buffer = new CircularBuffer(25)
 val stream = new java.io.PrintStream(buffer, true, UTF-8)
 
+// scalastyle:off println
--- End diff --

Same, false positive, just wondering if this really fails


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5095][MESOS] Support capping cores and ...

2015-07-01 Thread andrewor14

Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/4027#issuecomment-117832235
  
@tnachen can you address the comments to simplify this patch using 
`spark.executor.cores` instead? As discussed above this is an optional setting 
and reuses the same setting across cluster managers. It's also much simpler to 
reason about than the two proposed configs in this patch.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8450] [SQL] [PYSARK] cleanup type conve...

2015-07-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7131#issuecomment-117832863
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5095][MESOS] Support capping cores and ...

2015-07-01 Thread andrewor14

Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/4027#discussion_r33728044
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/CoarseMesosSchedulerBackend.scala
 ---
@@ -109,6 +123,13 @@ private[spark] class CoarseMesosSchedulerBackend(
 }
   }
 
+  protected def driverUrl: String = AkkaUtils.address(
+AkkaUtils.protocol(sc.env.actorSystem),
+SparkEnv.driverActorSystemName,
+conf.get(spark.driver.host),
+conf.get(spark.driver.port),
+CoarseGrainedSchedulerBackend.ACTOR_NAME)
--- End diff --

I suggested this in #4984, but this could just do a check to see if we're 
in a test:
```
private val driverUrl: String = {
  val testing = conf.get(spark.testing, false).toBoolean
  if (testing) {
stub // Mock this class without connecting to a non-existent driver 
in tests
  } else {
AkkaUtils.address(...)
  }
}
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8450] [SQL] [PYSARK] cleanup type conve...

2015-07-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7131#issuecomment-117832900
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8450] [SQL] [PYSARK] cleanup type conve...

2015-07-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7131#issuecomment-117835854
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8450] [SQL] [PYSARK] cleanup type conve...

2015-07-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7131#issuecomment-117835834
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8766] support non-ascii character in co...

2015-07-01 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7165#issuecomment-117837584
  
  [Test build #36306 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36306/console)
 for   PR 7165 at commit 
[`3b09d31`](https://github.com/apache/spark/commit/3b09d3146d2cf867d50500b727aeace5f7a8ac16).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8425][Core][WIP] Add blacklist mechanis...

2015-07-01 Thread squito

Github user squito commented on a diff in the pull request:

https://github.com/apache/spark/pull/6870#discussion_r33729664
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/ExecutorBlacklistTracker.scala 
---
@@ -0,0 +1,175 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.scheduler
+
+import java.util.concurrent.TimeUnit
+
+import scala.collection.mutable
+
+import org.apache.spark._
+import org.apache.spark.util.{Clock, SystemClock, ThreadUtils, Utils}
+
+/**
+ * ExecutorBlacklistTracker blacklists the executors by tracking the 
status of running tasks with
+ * heuristic algorithm.
+ *
+ * A executor will be considered bad enough only when:
+ * 1. The failure task number on this executor is more than
+ *spark.scheduler.blacklist.executorFaultThreshold.
+ * 2. The failure task number on this executor is
+ *spark.scheduler.blacklist.averageBlacklistThreshold more than 
average failure task number
+ *of this cluster.
+ *
+ * Also max number of blacklisted executors will not exceed the
+ * spark.scheduler.blacklist.maxBlacklistFraction of whole cluster, and 
blacklisted executors
+ * will be forgiven when there is no failure tasks in the
+ * spark.scheduler.blacklist.executorFaultTimeoutWindowInMinutes.
+ */
+private[spark] class ExecutorBlacklistTracker(conf: SparkConf) extends 
SparkListener {
+  import ExecutorBlacklistTracker._
+
+  private val maxBlacklistFraction = conf.getDouble(
+spark.scheduler.blacklist.maxBlacklistFraction, 
MAX_BLACKLIST_FRACTION)
+  private val avgBlacklistThreshold = conf.getDouble(
+spark.scheduler.blacklist.averageBlacklistThreshold, 
AVERAGE_BLACKLIST_THRESHOLD)
+  private val executorFaultThreshold = conf.getInt(
+spark.scheduler.blacklist.executorFaultThreshold, 
EXECUTOR_FAULT_THRESHOLD)
+  private val executorFaultTimeoutWindowInMinutes = conf.getInt(
+spark.scheduler.blacklist.executorFaultTimeoutWindowInMinutes, 
EXECUTOR_FAULT_TIMEOUT_WINDOW)
+
+  // Count the number of executors registered
+  var numExecutorsRegistered: Int = 0
+
+  // Track the number of failure tasks and time of latest failure to 
executor id
+  val executorIdToTaskFailures = new mutable.HashMap[String, 
ExecutorFailureStatus]()
+
+  // Clock used to update and exclude the executors which are out of time 
window.
+  private var clock: Clock = new SystemClock()
+
+  // Executor that handles the scheduling task
+  private val executor = 
ThreadUtils.newDaemonSingleThreadScheduledExecutor(
+spark-scheduler-blacklist-expire-timer)
+
+  def start(): Unit = {
+val scheduleTask = new Runnable() {
+  override def run(): Unit = {
+Utils.logUncaughtExceptions(expireTimeoutExecutorBlacklist())
+  }
+}
+executor.scheduleAtFixedRate(scheduleTask, 0L, 60, TimeUnit.SECONDS)
+  }
+
+  def stop(): Unit = {
+executor.shutdown()
+executor.awaitTermination(10, TimeUnit.SECONDS)
+  }
+
+  def setClock(newClock: Clock): Unit = {
+clock = newClock
+  }
+
+  def getExecutorBlacklist: Set[String] = synchronized {
+executorIdToTaskFailures.filter(_._2.isBlackListed).keys.toSet
--- End diff --

this gets called a lot, and (hopefully) is updated only rarely.  It should 
probably be computed only when it changes, and then stored.  (Also I think the 
stored value could probably just be `@volatile`, you wouldn't need to 
synchronize ... but I'm not 100% sure ...)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail:

[GitHub] spark pull request: [SPARK-8660] [MLLib] removed symbols from co...

2015-07-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7167#issuecomment-117820196
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8746][SQL] update download link for Hiv...

2015-07-01 Thread ckadner

Github user ckadner commented on the pull request:

https://github.com/apache/spark/pull/7144#issuecomment-117820180
  
@JoshRosen -- Hi Josh, thanks for kicking of the tests. Can you help me 
make sense of the test results?

I only changed a markdown file, no test cases should be impacted, no 
interface methods touched to throw off the MiMa tests.

```
[error] running /home/jenkins/workspace/SparkPullRequestBuilder@2/dev/mima 
; received return code 255
Archiving unit tests logs...
 No log files found.
Attempting to post to Github...
  Post successful.
Archiving artifacts
WARN: No artifacts found that match the file pattern 
**/target/unit-tests.log. Configuration error?
WARN: java.lang.InterruptedException: no matches found within 1
Recording test results
ERROR: Publisher 'Publish JUnit test result report' failed: No test report 
files were found. Configuration error?
Finished: FAILURE
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8695] [core] [WIP] TreeAggregation shou...

2015-07-01 Thread piganesh

GitHub user piganesh opened a pull request:

https://github.com/apache/spark/pull/7168

[SPARK-8695] [core] [WIP] TreeAggregation shouldn't be triggered for â¦

â¦5 partitions

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ibmsoe/spark SPARK-8695

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/7168.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #7168


commit a6fed07f65c3c71e19f231c54fa43fc4904b040d
Author: Perinkulam I. Ganesh g...@us.ibm.com
Date:   2015-07-01T20:50:28Z

[SPARK-8695] [core] [WIP] TreeAggregation shouldn't be triggered for 5 
partitions




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8029][core] shuffleoutput per attempt

2015-07-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6648#issuecomment-117823839
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8029][core] shuffleoutput per attempt

2015-07-01 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/6648#issuecomment-117823675
  
**[Test build #36289 timed 
out](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36289/console)**
 for PR 6648 at commit 
[`55a9bb1`](https://github.com/apache/spark/commit/55a9bb1d8617424e059e7de052ecaca154e4ab44)
 after a configured wait of `175m`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8029][core] shuffleoutput per attempt

2015-07-01 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/6648#issuecomment-117827909
  
  [Test build #36302 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36302/consoleFull)
 for   PR 6648 at commit 
[`55a9bb1`](https://github.com/apache/spark/commit/55a9bb1d8617424e059e7de052ecaca154e4ab44).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-1564] [Docs] Added Javascript to Javado...

2015-07-01 Thread deroneriksson

GitHub user deroneriksson opened a pull request:

https://github.com/apache/spark/pull/7169

[SPARK-1564] [Docs] Added Javascript to Javadocs to create badges for tags 
like :: Experimental ::

Modified copy_api_dirs.rb and created api-javadocs.js and api-javadocs.css 
files in order to add badges to javadoc files for :: Experimental ::, :: 
DeveloperApi ::, and :: AlphaComponent :: tags

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/deroneriksson/spark SPARK-1564_JavaDocs_badges

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/7169.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #7169


commit 65b493072eb385ef25151a6485369f801730c760
Author: Deron Eriksson de...@us.ibm.com
Date:   2015-07-01T21:08:43Z

Modified copy_api_dirs.rb and created api-javadocs.js and api-javadocs.css 
files in order to add badges to javadoc files for :: Experimental ::, :: 
DeveloperApi ::, and :: AlphaComponent :: tags




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6287][MESOS] Add dynamic allocation to ...

2015-07-01 Thread andrewor14

Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/4984#discussion_r33726149
  
--- Diff: 
core/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala ---
@@ -124,10 +124,16 @@ private[spark] class DiskBlockManager(blockManager: 
BlockManager, conf: SparkCon
 (blockId, getFile(blockId))
   }
 
+  /**
+   * Create local directories for storing block data. These directories are
+   * located inside configured local directories and won't
+   * be deleted on JVM exit when using the external shuffle service.
--- End diff --

I just read your comment again. I still don't see how the directory layout 
is related to cleaning up shuffle files. The reason why we don't clean up 
shuffle files in Mesos (and standalone mode) is simply because the shuffle 
service doesn't know when an application exits. When shuffle service is 
enabled, [executors no longer clean up the shuffle files on 
exit](https://github.com/apache/spark/blob/1ce6428907b4ddcf52dbf0c86196d82ab7392442/core/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala#L162),
 so no one cleans these files up anymore. All we need to do then is to add this 
missing code path.

Since the external shuffle service already 
[knows](https://github.com/apache/spark/blob/1ce6428907b4ddcf52dbf0c86196d82ab7392442/network/shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java#L147)
 about the `localDirs` on each executor, it can just go ahead and delete these 
directories (which contain the shuffle files written). Could you explain why 
the directory structure needs to change? Why is it not sufficient to just 
remove the shuffle directories?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7977] [Build] Disallowing println

2015-07-01 Thread srowen

Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/7093#issuecomment-117829855
  
Heroic effort here. It certainly is big but is fixing up a lot of little 
issues of this form. I flagged a few more questions about it but it is looking 
pretty good.

Is anyone else concerned about the cost of future merge conflicts on this 
one vs the benefit?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-1564] [Docs] Added Javascript to Javado...

2015-07-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7169#issuecomment-117831150
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-1564] [Docs] Added Javascript to Javado...

2015-07-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7169#issuecomment-117831126
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-8313] R Spark packages support

2015-07-01 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7139#issuecomment-117832007
  
  [Test build #36294 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36294/console)
 for   PR 7139 at commit 
[`0226768`](https://github.com/apache/spark/commit/0226768c1cddce134253b9d817098880f04de0a5).
 * This patch **fails SparkR unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-8313] R Spark packages support

2015-07-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7139#issuecomment-117832118
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6910] [WiP] Reduce number of operations...

2015-07-01 Thread liancheng

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/7049#discussion_r33727646
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala ---
@@ -379,31 +379,25 @@ abstract class HadoopFsRelation 
private[sql](maybePartitionSpec: Option[Partitio
 var leafDirToChildrenFiles = mutable.Map.empty[Path, Array[FileStatus]]
 
 def refresh(): Unit = {
-  // We don't filter files/directories whose name start with _ 
except _temporary here, as
-  // specific data sources may take advantages over them (e.g. Parquet 
_metadata and
-  // _common_metadata files). _temporary directories are explicitly 
ignored since failed
-  // tasks/jobs may leave partial/corrupted data files there.
-  def listLeafFilesAndDirs(fs: FileSystem, status: FileStatus): 
Set[FileStatus] = {
-if (status.getPath.getName.toLowerCase == _temporary) {
-  Set.empty
-} else {
-  val (dirs, files) = 
fs.listStatus(status.getPath).partition(_.isDir)
-  val leafDirs = if (dirs.isEmpty) Set(status) else 
Set.empty[FileStatus]
-  files.toSet ++ leafDirs ++ dirs.flatMap(dir = 
listLeafFilesAndDirs(fs, dir))
-}
-  }
 
   leafFiles.clear()
 
-  val statuses = paths.flatMap { path =
+  val statuses = paths.par.flatMap { path =
 val hdfsPath = new Path(path)
 val fs = hdfsPath.getFileSystem(hadoopConf)
 val qualified = hdfsPath.makeQualified(fs.getUri, 
fs.getWorkingDirectory)
-
Try(fs.getFileStatus(qualified)).toOption.toArray.flatMap(listLeafFilesAndDirs(fs,
 _))
+val it = fs.listFiles(qualified, true)
--- End diff --

Unfortunately this API doesn't exist in Hadoop 1...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8450] [SQL] [PYSARK] cleanup type conve...

2015-07-01 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7131#issuecomment-117833346
  
  [Test build #36305 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36305/consoleFull)
 for   PR 7131 at commit 
[`7104e97`](https://github.com/apache/spark/commit/7104e97f01f47c6405bc8f8e51b5485ddca27efe).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8766] support non-ascii character in co...

2015-07-01 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7165#issuecomment-117834553
  
  [Test build #36306 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36306/consoleFull)
 for   PR 7165 at commit 
[`3b09d31`](https://github.com/apache/spark/commit/3b09d3146d2cf867d50500b727aeace5f7a8ac16).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8425][Core][WIP] Add blacklist mechanis...

2015-07-01 Thread squito

Github user squito commented on a diff in the pull request:

https://github.com/apache/spark/pull/6870#discussion_r33729199
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/ExecutorBlacklistTracker.scala 
---
@@ -0,0 +1,175 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.scheduler
+
+import java.util.concurrent.TimeUnit
+
+import scala.collection.mutable
+
+import org.apache.spark._
+import org.apache.spark.util.{Clock, SystemClock, ThreadUtils, Utils}
+
+/**
+ * ExecutorBlacklistTracker blacklists the executors by tracking the 
status of running tasks with
+ * heuristic algorithm.
+ *
+ * A executor will be considered bad enough only when:
+ * 1. The failure task number on this executor is more than
+ *spark.scheduler.blacklist.executorFaultThreshold.
+ * 2. The failure task number on this executor is
+ *spark.scheduler.blacklist.averageBlacklistThreshold more than 
average failure task number
+ *of this cluster.
+ *
+ * Also max number of blacklisted executors will not exceed the
+ * spark.scheduler.blacklist.maxBlacklistFraction of whole cluster, and 
blacklisted executors
+ * will be forgiven when there is no failure tasks in the
+ * spark.scheduler.blacklist.executorFaultTimeoutWindowInMinutes.
+ */
+private[spark] class ExecutorBlacklistTracker(conf: SparkConf) extends 
SparkListener {
+  import ExecutorBlacklistTracker._
+
+  private val maxBlacklistFraction = conf.getDouble(
+spark.scheduler.blacklist.maxBlacklistFraction, 
MAX_BLACKLIST_FRACTION)
+  private val avgBlacklistThreshold = conf.getDouble(
+spark.scheduler.blacklist.averageBlacklistThreshold, 
AVERAGE_BLACKLIST_THRESHOLD)
+  private val executorFaultThreshold = conf.getInt(
+spark.scheduler.blacklist.executorFaultThreshold, 
EXECUTOR_FAULT_THRESHOLD)
+  private val executorFaultTimeoutWindowInMinutes = conf.getInt(
+spark.scheduler.blacklist.executorFaultTimeoutWindowInMinutes, 
EXECUTOR_FAULT_TIMEOUT_WINDOW)
--- End diff --

these new confs need to be documented (along with 
spark.scheduler.blacklist.enabled)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-8313] R Spark packages support

2015-07-01 Thread shivaram

Github user shivaram commented on the pull request:

https://github.com/apache/spark/pull/7139#issuecomment-117835313
  
@JoshRosen @shaneknapp -- So in this case the SparkR unit tests failed but 
the AmpLabJenkins message says Test Passed ? Do you know what could cause this ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8660] [MLLib] removed symbols from co...

2015-07-01 Thread mengxr

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/7167#issuecomment-117837038
  
Let's keep the line width within 100. As @holdenk mentioned, we can copy  
paste a paragraph of code to Scala and ipython easily. I also tried RStudio, 
which takes multiline statement as well.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8450] [SQL] [PYSARK] cleanup type conve...

2015-07-01 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7131#issuecomment-117836993
  
  [Test build #36307 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36307/consoleFull)
 for   PR 7131 at commit 
[`7d73168`](https://github.com/apache/spark/commit/7d73168de9b45cb15229f9fe2e7e97304aa8c375).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3071] Increase default driver memory

2015-07-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7132#issuecomment-117818124
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3071] Increase default driver memory

2015-07-01 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7132#issuecomment-117818081
  
  [Test build #36291 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36291/console)
 for   PR 7132 at commit 
[`fd67721`](https://github.com/apache/spark/commit/fd67721c3d63b5aaca092c72c424bdeb726440c8).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7977] [Build] Disallowing println

2015-07-01 Thread jonalter

Github user jonalter commented on the pull request:

https://github.com/apache/spark/pull/7093#issuecomment-117820014
  
@srowen - I have addressed the point you made regarding using log rather 
than println in some cases. Please let me know what you think. Thank you.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8746][SQL] update download link for Hiv...

2015-07-01 Thread JoshRosen

Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/7144#issuecomment-117821666
  
This is a rare build flakiness issue that we haven't diagnosed.  I kicked 
off tests just to test out a pull request builder script change.  This change 
looks fine to me, so I think we should merge this into master and 1.4.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8695] [core] [WIP] TreeAggregation shou...

2015-07-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7168#issuecomment-117823182
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-8313] R Spark packages support

2015-07-01 Thread shivaram

Github user shivaram commented on the pull request:

https://github.com/apache/spark/pull/7139#issuecomment-117825520
  
@brkyvz Thanks for sending out this PR. Its looking good. I had a couple of 
high level points
1. It might be good to point out in some of the error messages how the JAR 
should be structured for this to work. I know that it works out of the box with 
the SBT plugin but it will be good explain this for users who aren't using the 
plugin

2. Does this also do the package install on all the executors ? Its not 
important right now with the DataFrame API but some of the work we do in the 
future will run R code in the executors. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5016][MLLib] Distribute GMM mixture com...

2015-07-01 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7166#issuecomment-117825624
  
  [Test build #36301 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36301/console)
 for   PR 7166 at commit 
[`1da3c7f`](https://github.com/apache/spark/commit/1da3c7f609f55d2be3c95d9265d256f6d2dc669f).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7977] [Build] Disallowing println

2015-07-01 Thread srowen

Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/7093#discussion_r33726211
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/util/NumericParserSuite.scala ---
@@ -33,7 +33,9 @@ class NumericParserSuite extends SparkFunSuite {
 malformatted.foreach { s =
   intercept[SparkException] {
 NumericParser.parse(s)
+// scalastyle:off println
 println(sDidn't detect malformatted string $s.)
--- End diff --

Throw an exception even?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1537 [WiP] Application Timeline Server i...

2015-07-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5423#issuecomment-117828238
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5095][MESOS] Support capping cores and ...

2015-07-01 Thread andrewor14

Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/4027#discussion_r33728151
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/CoarseMesosSchedulerBackend.scala
 ---
@@ -211,35 +226,43 @@ private[spark] class CoarseMesosSchedulerBackend(
 
   for (offer - offers) {
 val slaveId = offer.getSlaveId.toString
-val mem = getResource(offer.getResourcesList, mem)
-val cpus = getResource(offer.getResourcesList, cpus).toInt
-if (totalCoresAcquired  maxCores 
-mem = MemoryUtils.calculateTotalMemory(sc) 
-cpus = 1 
+var remainingMem = getResource(offer.getResourcesList, mem)
+var remainingCores = getResource(offer.getResourcesList, 
cpus).toInt
+
--- End diff --

can you nix these blank lines? L233 too


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8766] support non-ascii character in co...

2015-07-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7165#issuecomment-117833781
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-1564] [Docs] Added Javascript to Javado...

2015-07-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7169#issuecomment-117833756
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8425][Core][WIP] Add blacklist mechanis...

2015-07-01 Thread squito

Github user squito commented on a diff in the pull request:

https://github.com/apache/spark/pull/6870#discussion_r33728907
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/ExecutorBlacklistTracker.scala 
---
@@ -0,0 +1,175 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.scheduler
+
+import java.util.concurrent.TimeUnit
+
+import scala.collection.mutable
+
+import org.apache.spark._
+import org.apache.spark.util.{Clock, SystemClock, ThreadUtils, Utils}
+
+/**
+ * ExecutorBlacklistTracker blacklists the executors by tracking the 
status of running tasks with
+ * heuristic algorithm.
+ *
+ * A executor will be considered bad enough only when:
+ * 1. The failure task number on this executor is more than
+ *spark.scheduler.blacklist.executorFaultThreshold.
+ * 2. The failure task number on this executor is
+ *spark.scheduler.blacklist.averageBlacklistThreshold more than 
average failure task number
+ *of this cluster.
+ *
+ * Also max number of blacklisted executors will not exceed the
+ * spark.scheduler.blacklist.maxBlacklistFraction of whole cluster, and 
blacklisted executors
+ * will be forgiven when there is no failure tasks in the
+ * spark.scheduler.blacklist.executorFaultTimeoutWindowInMinutes.
+ */
+private[spark] class ExecutorBlacklistTracker(conf: SparkConf) extends 
SparkListener {
+  import ExecutorBlacklistTracker._
+
+  private val maxBlacklistFraction = conf.getDouble(
+spark.scheduler.blacklist.maxBlacklistFraction, 
MAX_BLACKLIST_FRACTION)
+  private val avgBlacklistThreshold = conf.getDouble(
+spark.scheduler.blacklist.averageBlacklistThreshold, 
AVERAGE_BLACKLIST_THRESHOLD)
+  private val executorFaultThreshold = conf.getInt(
+spark.scheduler.blacklist.executorFaultThreshold, 
EXECUTOR_FAULT_THRESHOLD)
+  private val executorFaultTimeoutWindowInMinutes = conf.getInt(
+spark.scheduler.blacklist.executorFaultTimeoutWindowInMinutes, 
EXECUTOR_FAULT_TIMEOUT_WINDOW)
+
+  // Count the number of executors registered
+  var numExecutorsRegistered: Int = 0
+
+  // Track the number of failure tasks and time of latest failure to 
executor id
+  val executorIdToTaskFailures = new mutable.HashMap[String, 
ExecutorFailureStatus]()
+
+  // Clock used to update and exclude the executors which are out of time 
window.
+  private var clock: Clock = new SystemClock()
+
+  // Executor that handles the scheduling task
+  private val executor = 
ThreadUtils.newDaemonSingleThreadScheduledExecutor(
+spark-scheduler-blacklist-expire-timer)
+
+  def start(): Unit = {
+val scheduleTask = new Runnable() {
+  override def run(): Unit = {
+Utils.logUncaughtExceptions(expireTimeoutExecutorBlacklist())
+  }
+}
+executor.scheduleAtFixedRate(scheduleTask, 0L, 60, TimeUnit.SECONDS)
+  }
+
+  def stop(): Unit = {
+executor.shutdown()
+executor.awaitTermination(10, TimeUnit.SECONDS)
+  }
+
+  def setClock(newClock: Clock): Unit = {
+clock = newClock
+  }
+
+  def getExecutorBlacklist: Set[String] = synchronized {
+executorIdToTaskFailures.filter(_._2.isBlackListed).keys.toSet
+  }
+
+  override def onTaskEnd(taskEnd: SparkListenerTaskEnd): Unit = 
synchronized {
+taskEnd.reason match {
+  case _: FetchFailed | _: ExceptionFailure | TaskResultLost |
+  _: ExecutorLostFailure | UnknownReason =
+val failureStatus = 
executorIdToTaskFailures.getOrElseUpdate(taskEnd.taskInfo.executorId,
+  new ExecutorFailureStatus)
+failureStatus.numFailures += 1
+failureStatus.updatedTime = clock.getTimeMillis()
+
+// Update the executor blacklist
+updateExecutorBlacklist()
+  case _ = Unit
+}
+  }
+
+  override def onExecutorAdded(executorAdded: SparkListenerExecutorAdded): 
Unit =

[GitHub] spark pull request: [SPARK-8425][Core][WIP] Add blacklist mechanis...

2015-07-01 Thread squito

Github user squito commented on a diff in the pull request:

https://github.com/apache/spark/pull/6870#discussion_r33729777
  
--- Diff: 
core/src/test/scala/org/apache/spark/scheduler/ExecutorBlacklistTrackerSuite.scala
 ---
@@ -0,0 +1,158 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.scheduler
+
+import scala.collection.mutable
+
+import org.scalatest.{BeforeAndAfter, PrivateMethodTester}
+
+import org.apache.spark._
+import 
org.apache.spark.scheduler.ExecutorBlacklistTracker.ExecutorFailureStatus
+import org.apache.spark.scheduler.cluster.ExecutorInfo
+import org.apache.spark.util.ManualClock
+
+class ExecutorBlacklistTrackerSuite
+  extends SparkFunSuite
+  with LocalSparkContext
+  with BeforeAndAfter {
+  import ExecutorBlacklistTrackerSuite._
+
+  before {
+if (sc == null) {
+  sc = createSparkContext
+}
+  }
+
+  after {
+if (sc !=  null) {
+  sc.stop()
+  sc = null
+}
+  }
+
+  test(add executor to blacklist) {
+// Add 5 executors
+addExecutors(5)
+val tracker = sc.executorBlacklistTracker.get
+assert(numExecutorsRegistered(tracker) === 5)
+
+// Post 5 TaskEnd event to executor-1 to add executor-1 into blacklist
+(0 until 5).foreach(_ = postTaskEndEvent(TaskResultLost, 
executor-1))
+
+assert(tracker.getExecutorBlacklist === Set(executor-1))
+assert(executorIdToTaskFailures(tracker)(executor-1).numFailures === 
5)
+assert(executorIdToTaskFailures(tracker)(executor-1).isBlackListed 
=== true)
+
+// Post 10 TaskEnd event to executor-2 to add executor-2 into blacklist
+(0 until 10).foreach(_ = postTaskEndEvent(TaskResultLost, 
executor-2))
+assert(tracker.getExecutorBlacklist === Set(executor-1, 
executor-2))
+assert(executorIdToTaskFailures(tracker)(executor-2).numFailures === 
10)
+assert(executorIdToTaskFailures(tracker)(executor-2).isBlackListed 
=== true)
+
+// Post 5 TaskEnd event to executor-3 to verify whether executor-3 is 
blacklisted
+(0 until 5).foreach(_ = postTaskEndEvent(TaskResultLost, 
executor-3))
+// Since the failure number of executor-3 is less than the average 
blacklist threshold,
+// though exceed the fault threshold, still should not be added into 
blacklist
+assert(tracker.getExecutorBlacklist === Set(executor-1, 
executor-2))
+assert(executorIdToTaskFailures(tracker)(executor-3).numFailures === 
5)
+assert(executorIdToTaskFailures(tracker)(executor-3).isBlackListed 
=== false)
+
+// Keep post TaskEnd event to executor-3 to add executor-3 into 
blacklist
+(0 until 2).foreach(_ = postTaskEndEvent(TaskResultLost, 
executor-3))
+assert(tracker.getExecutorBlacklist === Set(executor-1, 
executor-2, executor-3))
+assert(executorIdToTaskFailures(tracker)(executor-3).numFailures === 
7)
+assert(executorIdToTaskFailures(tracker)(executor-3).isBlackListed 
=== true)
+
+// Post TaskEnd event to executor-4 to verify whether executor-4 could 
be added into blacklist
+(0 until 10).foreach(_ = postTaskEndEvent(TaskResultLost, 
executor-4))
+// Event executor-4's failure task number is above than blacklist 
threshold,
+// but the blacklisted executor number is reaching to maximum fraction,
+// so executor-4 still cannot be added into blacklist.
+assert(tracker.getExecutorBlacklist === Set(executor-1, 
executor-2, executor-3))
+assert(executorIdToTaskFailures(tracker)(executor-4).numFailures === 
10)
+assert(executorIdToTaskFailures(tracker)(executor-4).isBlackListed 
=== false)
+  }
+
+  test(remove executor from blacklist) {
+// Add 5 executors
+addExecutors(5)
+val tracker = sc.executorBlacklistTracker.get
+val clock = new ManualClock(1L)
+tracker.setClock(clock)
+assert(numExecutorsRegistered(tracker) === 5)
+

[GitHub] spark pull request: [SPARK-8766] support non-ascii character in co...

2015-07-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7165#issuecomment-117837626
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8766] support non-ascii character in co...

2015-07-01 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7165#issuecomment-117818193
  
  [Test build #36299 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36299/console)
 for   PR 7165 at commit 
[`867754a`](https://github.com/apache/spark/commit/867754aa2b852e971c6d4359d67a469b1f709611).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8766] support non-ascii character in co...

2015-07-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7165#issuecomment-117818218
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6287][MESOS] Add dynamic allocation to ...

2015-07-01 Thread andrewor14

Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/4984#discussion_r33723361
  
--- Diff: 
core/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala ---
@@ -124,10 +124,16 @@ private[spark] class DiskBlockManager(blockManager: 
BlockManager, conf: SparkCon
 (blockId, getFile(blockId))
   }
 
+  /**
+   * Create local directories for storing block data. These directories are
+   * located inside configured local directories and won't
+   * be deleted on JVM exit when using the external shuffle service.
--- End diff --

I just read it again. You mentioned that on Mesos the shuffle dir is not 
deleted, but its parent directory is. I'm confused about two things: First, if 
the parent directory is deleted, wouldn't everything it automatically be 
deleted as well?

Second, I actually don't see how the parent directory (the middle layer in 
your comment) is deleted. Are you referring to `DiskBlockManager#doStop()`? 
AFAIK that only deletes the temporary directory created inside of the parent 
directory (i.e. the shuffle dir), and we don't do this if external shuffle 
service. I can't find the code that deletes the middle layer itself.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

< 1 2 3 4 5 6 7 8 9 10 >

401 - 500 of 1082 matches

Mail list logo