[GitHub] spark pull request: [SPARK-8598] [MLlib] Implementation of 1-sampl...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/6994#issuecomment-117808024 @josepablocam In the PR description, please also include a summary of this PR because it becomes part of the commit message. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6287][MESOS] Add dynamic allocation to ...
Github user dragos commented on a diff in the pull request: https://github.com/apache/spark/pull/4984#discussion_r33718904 --- Diff: core/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala --- @@ -124,10 +124,16 @@ private[spark] class DiskBlockManager(blockManager: BlockManager, conf: SparkCon (blockId, getFile(blockId)) } + /** + * Create local directories for storing block data. These directories are + * located inside configured local directories and won't + * be deleted on JVM exit when using the external shuffle service. --- End diff -- Can you please read [my comment](https://github.com/apache/spark/pull/4984#issuecomment-117351436) again? I tried to explain to the best of my ability. If that's not clear, I'm happy to try again, but please ask more specific questions. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8378][Streaming]Add the Python API for ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/6830 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3071] Increase default driver memory
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7132#issuecomment-117797061 [Test build #36282 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36282/console) for PR 7132 at commit [`26cc177`](https://github.com/apache/spark/commit/26cc177d6ae76248273d84cd54c20839aed9e630). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8538][SPARK-8539][ML] Linear Regression...
Github user feynmanliang commented on a diff in the pull request: https://github.com/apache/spark/pull/7099#discussion_r33713764 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/RegressionMetrics.scala --- @@ -31,7 +31,8 @@ import org.apache.spark.sql.DataFrame * @param predictionAndObservations an RDD of (prediction, observation) pairs. */ @Experimental -class RegressionMetrics(predictionAndObservations: RDD[(Double, Double)]) extends Logging { +class RegressionMetrics(predictionAndObservations: RDD[(Double, Double)]) + extends Logging with Serializable { --- End diff -- Not too sure why but I was getting `Task Not Serializable` errors without this. I suspect this is because everything inside the `train()` method's closure gets serialized. I followed up on how lazy vals interact with serialization and found [this SO post](http://stackoverflow.com/questions/27882307/how-does-serialization-of-lazy-fields-work) which says that the value is serialized iff it was computed before serialization. In my updated implementation, one option could be to force the evaluation of `RegressionMetrics.summary` in the `LinearRegressionTestResults` constructor. However, despite being serializable I don't expect this class to be replicated anywhere except at the driver so maybe this eager evaluation is unnecessary. @jkbradley thoughts? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4485][SQL] (1) Add broadcast hash outer...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7162#issuecomment-117799843 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6287][MESOS] Add dynamic allocation to ...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/4984#discussion_r33715021 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -1241,8 +1253,37 @@ private[spark] class BlockManager( futureExecutionContext.shutdownNow() logInfo(BlockManager stopped) } -} + /** + * Contact all external shuffle services and de-register this application, cleaning + * shuffle files as well. + * + * This is required in Mesos mode with dynamic allocation, since executors leave behind + * shuffle files to be served by the external shuffle service. We need to delete those + * files when the application stops. + */ + private def cleanupAllShuffleFiles() { --- End diff -- This is only best-effort. If the driver is killed forcefully we still have shuffle files lingering around. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8538][SPARK-8539][ML] Linear Regression...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7099#issuecomment-117800789 [Test build #36295 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36295/consoleFull) for PR 7099 at commit [`9509c79`](https://github.com/apache/spark/commit/9509c799347738dc58d104ef5c4ae68f4c275c84). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-8313] R Spark packages support
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7139#issuecomment-117800747 [Test build #36294 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36294/consoleFull) for PR 7139 at commit [`0226768`](https://github.com/apache/spark/commit/0226768c1cddce134253b9d817098880f04de0a5). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7820][Build] Fix Java8-tests suite comp...
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/7120#issuecomment-117801763 LGTM, so I'm going to merge this into master and branch-1.4. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5562][MLlib] LDA should handle empty do...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/7064#discussion_r33716204 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala --- @@ -119,6 +120,14 @@ final class EMLDAOptimizer extends LDAOptimizer { } } +// create term vertices for empty docs +val emptyDocTermVertices: RDD[(VertexId, TopicCounts)] = --- End diff -- rename: emptyDocVertices --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3444] [core] Restore INFO level after l...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/7140 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-746][CORE][WIP] Added Avro Serializatio...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7004#issuecomment-117805703 [Test build #36298 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36298/consoleFull) for PR 7004 at commit [`d421bf5`](https://github.com/apache/spark/commit/d421bf5e54eb4678c6d21a44d100e84fe19d94ed). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8660][ML] Convert JavaDoc style comment...
Github user Rosstin commented on a diff in the pull request: https://github.com/apache/spark/pull/7096#discussion_r33719147 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala --- @@ -211,22 +211,22 @@ class LogisticRegressionSuite extends SparkFunSuite with MLlibTestSparkContext { val trainer = (new LogisticRegression).setFitIntercept(true) val model = trainer.fit(binaryDataset) -/** - * Using the following R code to load the data and train the model using glmnet package. - * - * library(glmnet) - * data - read.csv(path, header=FALSE) - * label = factor(data$V1) - * features = as.matrix(data.frame(data$V2, data$V3, data$V4, data$V5)) - * weights = coef(glmnet(features,label, family=binomial, alpha = 0, lambda = 0)) - * weights - * 5 x 1 sparse Matrix of class dgCMatrix - * s0 - * (Intercept) 2.8366423 - * data.V2 -0.5895848 - * data.V3 0.8931147 - * data.V4 -0.3925051 - * data.V5 -0.7996864 +/* + Using the following R code to load the data and train the model using glmnet package. + +library(glmnet) --- End diff -- Sure, no problem. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5016][MLLib] Distribute GMM mixture com...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7166#issuecomment-117817655 [Test build #36301 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36301/consoleFull) for PR 7166 at commit [`1da3c7f`](https://github.com/apache/spark/commit/1da3c7f609f55d2be3c95d9265d256f6d2dc669f). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5016][MLLib] Distribute GMM mixture com...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7166#issuecomment-117817442 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8538][SPARK-8539][ML] Linear Regression...
Github user feynmanliang commented on a diff in the pull request: https://github.com/apache/spark/pull/7099#discussion_r33712588 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala --- @@ -193,11 +199,36 @@ class LinearRegression(override val uid: String) val intercept = if ($(fitIntercept)) yMean - dot(weights, Vectors.dense(featuresMean)) else 0.0 if (handlePersistence) instances.unpersist() +val summary = generateTrainingResults(instances, lossHistory.result(), weights, intercept) + // TODO: Converts to sparse format based on the storage, but may base on the scoring speed. -copyValues(new LinearRegressionModel(uid, weights.compressed, intercept)) +copyValues(new LinearRegressionModel(uid, weights.compressed, intercept, summary)) + } + + private def generateTrainingResults( --- End diff -- OK. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8598] [MLlib] Implementation of 1-sampl...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/6994#discussion_r33713902 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/Statistics.scala --- @@ -158,4 +158,25 @@ object Statistics { def chiSqTest(data: RDD[LabeledPoint]): Array[ChiSqTestResult] = { ChiSqTest.chiSquaredFeatures(data) } + + /** + * Conduct a one-sample, two sided Kolmogorov Smirnov test for probability distribution equality + * @param data an `RDD[Double]` containing the sample of data to test + * @param cdf a `Double = Double` function to calculate the theoretical CDF at a given value + * @return KSTestResult object containing test statistic, p-value, and null hypothesis. + */ + def ksTest(data: RDD[Double], cdf: Double = Double): KSTestResult = { +KSTest.testOneSample(data, cdf) + } + + /** + * Convenience function to conduct a one-sample, two sided Kolmogorov Smirnov test for probability + * distribution equality. Currently supports standard normal distribution only. + * @param data an `RDD[Double]` containing the sample of data to test + * @param name a `String` name for a theoretical distribution --- End diff -- `name` - `dist` or `distName`? It is not clear what `name` means. You mentioned only standard normal distribution is supported but forgot to provide its corresponding distribution name in the doc. It is hard to guess `stdnorm` unless looking into the code. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8538][SPARK-8539][ML] Linear Regression...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7099#issuecomment-117799304 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8538][SPARK-8539][ML] Linear Regression...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7099#issuecomment-117799319 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6287][MESOS] Add dynamic allocation to ...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/4984#discussion_r33714143 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -1223,6 +1232,9 @@ private[spark] class BlockManager( } def stop(): Unit = { +if ((blockManagerId ne null) blockManagerId.isDriver) { --- End diff -- elsewhere in Spark we just do `if (blockManagerId != null blockManagerId.isDriver)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4072][Core]Display Streaming blocks in ...
Github user tdas commented on the pull request: https://github.com/apache/spark/pull/6672#issuecomment-117799361 @JoshRosen Any more thoughts on this PR? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8660][ML] Convert JavaDoc style comment...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/7096#discussion_r33714571 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala --- @@ -211,22 +211,22 @@ class LogisticRegressionSuite extends SparkFunSuite with MLlibTestSparkContext { val trainer = (new LogisticRegression).setFitIntercept(true) val model = trainer.fit(binaryDataset) -/** - * Using the following R code to load the data and train the model using glmnet package. - * - * library(glmnet) - * data - read.csv(path, header=FALSE) - * label = factor(data$V1) - * features = as.matrix(data.frame(data$V2, data$V3, data$V4, data$V5)) - * weights = coef(glmnet(features,label, family=binomial, alpha = 0, lambda = 0)) - * weights - * 5 x 1 sparse Matrix of class dgCMatrix - * s0 - * (Intercept) 2.8366423 - * data.V2 -0.5895848 - * data.V3 0.8931147 - * data.V4 -0.3925051 - * data.V5 -0.7996864 +/* + Using the following R code to load the data and train the model using glmnet package. + +library(glmnet) --- End diff -- @Rosstin Could you also remove ``s? Then people can easily copy paste. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8660][ML] Convert JavaDoc style comment...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/7096#discussion_r33714575 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala --- @@ -211,22 +211,22 @@ class LogisticRegressionSuite extends SparkFunSuite with MLlibTestSparkContext { val trainer = (new LogisticRegression).setFitIntercept(true) val model = trainer.fit(binaryDataset) -/** - * Using the following R code to load the data and train the model using glmnet package. - * - * library(glmnet) - * data - read.csv(path, header=FALSE) - * label = factor(data$V1) - * features = as.matrix(data.frame(data$V2, data$V3, data$V4, data$V5)) - * weights = coef(glmnet(features,label, family=binomial, alpha = 0, lambda = 0)) - * weights - * 5 x 1 sparse Matrix of class dgCMatrix - * s0 - * (Intercept) 2.8366423 - * data.V2 -0.5895848 - * data.V3 0.8931147 - * data.V4 -0.3925051 - * data.V5 -0.7996864 +/* + Using the following R code to load the data and train the model using glmnet package. + +library(glmnet) +data - read.csv(path, header=FALSE) +label = factor(data$V1) +features = as.matrix(data.frame(data$V2, data$V3, data$V4, data$V5)) +weights = coef(glmnet(features,label, family=binomial, alpha = 0, lambda = 0)) +weights --- End diff -- If we remove ``, please insert an empty line to separate commands and results. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6287][MESOS] Add dynamic allocation to ...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/4984#discussion_r33714625 --- Diff: network/shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleClient.java --- @@ -138,6 +139,26 @@ public void registerWithShuffleServer( client.sendRpcSync(registerMessage, 5000 /* timeoutMs */); } + /** + * Removes this application from the external shuffle server and optionally deletes local + * files. + * + * @param host Host of the shuffle server. + * @param port Port of the shuffle server. + * @param cleanupLocalDirs True if corresponding shuffle files should be deleted + * @throws IOException + */ + public void applicationRemoved(String host, + int port, --- End diff -- This is really strange. When I have a callback called `applicationRemoved` I would expect to see an appID, not a host and a port. All of that should be transparent to the caller and handled internally right? The other thing is that there's already similar logic in `ExternalShuffleBlockResolver#applicationRemoved`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8538][SPARK-8539][ML] Linear Regression...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7099#issuecomment-117800161 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6287][MESOS] Add dynamic allocation to ...
Github user dragos commented on a diff in the pull request: https://github.com/apache/spark/pull/4984#discussion_r33719412 --- Diff: network/shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleClient.java --- @@ -138,6 +139,26 @@ public void registerWithShuffleServer( client.sendRpcSync(registerMessage, 5000 /* timeoutMs */); } + /** + * Removes this application from the external shuffle server and optionally deletes local + * files. + * + * @param host Host of the shuffle server. + * @param port Port of the shuffle server. + * @param cleanupLocalDirs True if corresponding shuffle files should be deleted + * @throws IOException + */ + public void applicationRemoved(String host, + int port, --- End diff -- Note that other (all?) methods in this class take a host and a port. This is not a callback. This is a client wrapper for connecting to the ExternalShuffleClient. They can be on any executor. How would you tell the client where to connect to? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5016] Distribute GMM mixture components...
GitHub user feynmanliang opened a pull request: https://github.com/apache/spark/pull/7166 [SPARK-5016] Distribute GMM mixture components to executors Distribute expensive portions of computation for Gaussian mixture components (in particular, pre-computation of `MultivariateGaussian.rootSigmaInv`, the inverse covariance matrix and covariance determinant) across executors. Repost of PR#4654. You can merge this pull request into a Git repository by running: $ git pull https://github.com/feynmanliang/spark GMM_parallel_mixtures Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/7166.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #7166 commit 1da3c7f609f55d2be3c95d9265d256f6d2dc669f Author: Feynman Liang fli...@databricks.com Date: 2015-06-30T22:58:58Z Distribute mixtures --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8647][MLlib] Potential issue with const...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7146#issuecomment-117798423 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8538][SPARK-8539][ML] Linear Regression...
Github user feynmanliang commented on the pull request: https://github.com/apache/spark/pull/7099#issuecomment-117798935 It wasn't clear to me how to use DFs in the result classes I was creating since they didn't have access to the model parameters (featuresCol, predictionCol, etc). I could add them as constructor params if you think that'll be better but it's not clear to me what the benefit of using a DF in the result classes is since most use cases will only be interested in the summary functions rather than the predictions + labels themselves. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6287][MESOS] Add dynamic allocation to ...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/4984#discussion_r33714016 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -111,12 +113,19 @@ private[spark] class BlockManager( // Client to read other executors' shuffle files. This is either an external service, or just the // standard BlockTransferService to directly connect to other Executors. - private[spark] val shuffleClient = if (externalShuffleServiceEnabled) { -val transConf = SparkTransportConf.fromSparkConf(conf, numUsableCores) -new ExternalShuffleClient(transConf, securityManager, securityManager.isAuthenticationEnabled(), - securityManager.isSaslEncryptionEnabled()) - } else { -blockTransferService + private[spark] val shuffleClient = mockShuffleClient.getOrElse(createShuffleClient) + + private def createShuffleClient: ShuffleClient = { --- End diff -- This doesn't need to be a method. It's probably fine to just do ``` private[spark] val shuffleClient = mockShuffleClient.getOrElse { if (externalShuffleServiceEnabled) ... } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [ML][Minor] update transformSchema methods of ...
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/6433#issuecomment-117799019 @RoyGao Could you please create a JIRA and add it to this PR's title? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6287][MESOS] Add dynamic allocation to ...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/4984#discussion_r33714710 --- Diff: core/src/test/scala/org/apache/spark/scheduler/mesos/CoarseMesosSchedulerBackendSuite.scala --- @@ -0,0 +1,187 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.scheduler.mesos + +import java.util +import java.util.Collections + +import scala.collection.mutable + +import akka.actor.ActorSystem + +import com.typesafe.config.Config + +import org.apache.mesos.Protos.Value.Scalar +import org.apache.mesos.Protos._ +import org.apache.mesos.SchedulerDriver +import org.apache.mesos.MesosSchedulerDriver +import org.apache.spark.scheduler.TaskSchedulerImpl +import org.apache.spark.scheduler.cluster.mesos.{ CoarseMesosSchedulerBackend, MemoryUtils } +import org.apache.spark.{ LocalSparkContext, SparkConf, SparkEnv, SparkContext } + +import org.mockito.Matchers._ +import org.mockito.Mockito._ +import org.mockito.{ ArgumentCaptor, Matchers } + +import org.scalatest.FunSuite +import org.scalatest.mock.MockitoSugar +import org.scalatest.BeforeAndAfter + +class CoarseMesosSchedulerBackendSuite extends FunSuite +with LocalSparkContext +with MockitoSugar +with BeforeAndAfter { + + private def createOffer(offerId: String, slaveId: String, mem: Int, cpu: Int): Offer = { +val builder = Offer.newBuilder() +builder.addResourcesBuilder() + .setName(mem) + .setType(Value.Type.SCALAR) + .setScalar(Scalar.newBuilder().setValue(mem)) +builder.addResourcesBuilder() + .setName(cpus) + .setType(Value.Type.SCALAR) + .setScalar(Scalar.newBuilder().setValue(cpu)) +builder.setId(OfferID.newBuilder() + .setValue(offerId).build()) + .setFrameworkId(FrameworkID.newBuilder() +.setValue(f1)) + .setSlaveId(SlaveID.newBuilder().setValue(slaveId)) + .setHostname(shost${slaveId}) + .build() + } + + private def createSchedulerBackend(taskScheduler: TaskSchedulerImpl, +driver: SchedulerDriver): CoarseMesosSchedulerBackend = { --- End diff -- style: ``` private def createSchedulerBackend( taskScheduler: TaskSchedulerImpl, driver: SchedulerDriver): CoarseMesosSchedulerBackend = { ... } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-8313] R Spark packages support
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7139#issuecomment-117800154 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-8313] R Spark packages support
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7139#issuecomment-117800176 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8538][SPARK-8539][ML] Linear Regression...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7099#issuecomment-117800175 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6287][MESOS] Add dynamic allocation to ...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/4984#discussion_r33714772 --- Diff: core/src/test/scala/org/apache/spark/scheduler/mesos/CoarseMesosSchedulerBackendSuite.scala --- @@ -0,0 +1,187 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.scheduler.mesos + +import java.util +import java.util.Collections + +import scala.collection.mutable + +import akka.actor.ActorSystem + +import com.typesafe.config.Config + +import org.apache.mesos.Protos.Value.Scalar +import org.apache.mesos.Protos._ +import org.apache.mesos.SchedulerDriver +import org.apache.mesos.MesosSchedulerDriver +import org.apache.spark.scheduler.TaskSchedulerImpl +import org.apache.spark.scheduler.cluster.mesos.{ CoarseMesosSchedulerBackend, MemoryUtils } +import org.apache.spark.{ LocalSparkContext, SparkConf, SparkEnv, SparkContext } + +import org.mockito.Matchers._ +import org.mockito.Mockito._ +import org.mockito.{ ArgumentCaptor, Matchers } --- End diff -- style: no space before after `{}`. Also all of these third party apps should just be grouped together. See other files. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-746][CORE][WIP] Added Avro Serializatio...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7004#issuecomment-117805393 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-746][CORE][WIP] Added Avro Serializatio...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7004#issuecomment-117805408 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-746][CORE][WIP] Added Avro Serializatio...
Github user JDrit commented on a diff in the pull request: https://github.com/apache/spark/pull/7004#discussion_r33716906 --- Diff: core/pom.xml --- @@ -398,6 +398,40 @@ artifactIdpy4j/artifactId version0.8.2.1/version /dependency +dependency + groupIdorg.apache.avro/groupId + artifactIdavro/artifactId + version${avro.version}/version + scope${hadoop.deps.scope}/scope +/dependency +dependency + groupIdorg.apache.avro/groupId + artifactIdavro-mapred/artifactId + version${avro.version}/version --- End diff -- Thanks for pointing that out, just fixed that. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8598] [MLlib] Implementation of 1-sampl...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/6994#discussion_r33717906 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/test/KSTest.scala --- @@ -0,0 +1,191 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.stat.test + +import org.apache.commons.math3.distribution.{NormalDistribution, RealDistribution} +import org.apache.commons.math3.stat.inference.KolmogorovSmirnovTest + +import org.apache.spark.rdd.RDD + +/** + * Conduct the two-sided Kolmogorov Smirnov test for data sampled from a + * continuous distribution. By comparing the largest difference between the empirical cumulative + * distribution of the sample data and the theoretical distribution we can provide a test for the + * the null hypothesis that the sample data comes from that theoretical distribution. + * For more information on KS Test: https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test + * + * Implementation note: We seek to implement the KS test with a minimal number of distributed + * passes. We sort the RDD, and then perform the following operations on a per-partition basis: + * calculate an empirical cumulative distribution value for each observation, and a theoretical + * cumulative distribution value. We know the latter to be correct, while the former will be off by + * a constant (how large the constant is depends on how many values precede it in other partitions). + * However, given that this constant simply shifts the ECDF upwards, but doesn't change its shape, + * and furthermore, that constant is the same within a given partition, we can pick 2 values + * in each partition that can potentially resolve to the largest global distance. Namely, we + * pick the minimum distance and the maximum distance. Additionally, we keep track of how many + * elements are in each partition. Once these three values have been returned for every partition, + * we can collect and operate locally. Locally, we can now adjust each distance by the appropriate + * constant (the cumulative sum of # of elements in the prior partitions divided by the data set + * size). Finally, we take the maximum absolute value, and this is the statistic. + */ +private[stat] object KSTest { + + // Null hypothesis for the type of KS test to be included in the result. + object NullHypothesis extends Enumeration { +type NullHypothesis = Value +val oneSampleTwoSided = Value(Sample follows theoretical distribution.) + } + + /** + * Runs a KS test for 1 set of sample data, comparing it to a theoretical distribution + * @param data `RDD[Double]` data on which to run test + * @param cdf `Double = Double` function to calculate the theoretical CDF + * @return KSTestResult summarizing the test results (pval, statistic, and null hypothesis) + */ + def testOneSample(data: RDD[Double], cdf: Double = Double): KSTestResult = { +val n = data.count().toDouble +val localData = data.sortBy(x = x).mapPartitions { part = + val partDiffs = oneSampleDifferences(part, n, cdf) // local distances + searchOneSampleCandidates(partDiffs) // candidates: local extrema +}.collect() +val ksStat = searchOneSampleStatistic(localData, n) // result: global extreme +evalOneSampleP(ksStat, n.toLong) + } + + /** + * Runs a KS test for 1 set of sample data, comparing it to a theoretical distribution + * @param data `RDD[Double]` data on which to run test + * @param createDist `Unit = RealDistribution` function to create a theoretical distribution + * @return KSTestResult summarizing the test results (pval, statistic, and null hypothesis) + */ + def testOneSample(data: RDD[Double], createDist: () = RealDistribution): KSTestResult = { +val n = data.count().toDouble +val localData = data.sortBy(x = x).mapPartitions { part = + val
[GitHub] spark pull request: [SPARK-8538][SPARK-8539][ML] Linear Regression...
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/7099#issuecomment-117813319 It wasn't clear to me how to use DFs in the result classes I was creating since they didn't have access to the model parameters (featuresCol, predictionCol, etc). I could add them as constructor params if you think that'll be better but it's not clear to me what the benefit of using a DF in the result classes is since most use cases will only be interested in the summary functions rather than the predictions + labels themselves. Sorry, this was ambiguous. The plan is to have DFs for each result type, not necessarily ones zipped with the transformed data. Later on, we could provide extra output columns to include the values in the transformed data, but we won't just yet. E.g., we can provide a DataFrame storing only 1 column of residuals. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8479] [MLlib] Add numNonzeros and numAc...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/6904#issuecomment-117815373 [Test build #36288 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36288/console) for PR 6904 at commit [`252c6b7`](https://github.com/apache/spark/commit/252c6b72426300fa0859c9d83c0b014b5f94bf6e). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8766] support non-ascii character in co...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7165#issuecomment-117816111 [Test build #36299 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36299/consoleFull) for PR 7165 at commit [`867754a`](https://github.com/apache/spark/commit/867754aa2b852e971c6d4359d67a469b1f709611). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8660] [MLLib] removed symbols from co...
GitHub user Rosstin opened a pull request: https://github.com/apache/spark/pull/7167 [SPARK-8660] [MLLib] removed symbols from comments in LogisticRegressionSuite.scala for ease of copypaste '' symbols removed from comments in LogisticRegressionSuite.scala, for ease of copypaste also single-lined the multiline commands (is this desirable, or does it violate style?) You can merge this pull request into a Git repository by running: $ git pull https://github.com/Rosstin/spark SPARK-8660-2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/7167.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #7167 commit 6c18058336a1a027207d194c31134351ec3ed86d Author: Rosstin astera...@gmail.com Date: 2015-06-26T18:00:35Z fixed minor typos in docs/README.md and docs/api.md commit 21ac1e54283d633e5c4978427e03937b17c1b626 Author: Rosstin astera...@gmail.com Date: 2015-06-29T17:06:15Z Merge branch 'master' of github.com:apache/spark into SPARK-8639 commit 2cd298520f7018fbd2fc5174f40a984db52b68a0 Author: Rosstin astera...@gmail.com Date: 2015-06-29T20:06:15Z Merge branch 'master' of github.com:apache/spark into SPARK-8639 commit 242aeddcd949f23c86c5b3000e27019a901df64b Author: Rosstin astera...@gmail.com Date: 2015-06-29T20:18:39Z SPARK-8660, changed comment style from JavaDoc style to normal multiline comment in order to make copypaste into R easier, in file classification/LogisticRegressionSuite.scala commit bb9a4b19487c55c826d0a49b1987dba9b4d7d031 Author: Rosstin astera...@gmail.com Date: 2015-06-29T20:21:17Z Merge branch 'master' of github.com:apache/spark into SPARK-8660 commit 5a05dee9fb142e8997b85904d01a02964ed32553 Author: Rosstin astera...@gmail.com Date: 2015-06-29T20:27:44Z SPARK-8661 for LinearRegressionSuite.scala, changed javadoc-style comments to regular multiline comments to make it easier to copy-paste the R code. commit 39ddd50ee27d80debf02cf9a5985c8bf2f4cb94c Author: Rosstin astera...@gmail.com Date: 2015-07-01T20:13:37Z Merge branch 'master' of github.com:apache/spark into SPARK-8661 commit fe6b11224126adb5692292fbb8b8c1bc03c48f46 Author: Rosstin astera...@gmail.com Date: 2015-07-01T20:40:41Z SPARK-8660 symbols removed from LogisticRegressionSuite.scala for easy of copypaste --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark-8703] [ML] Add CountVectorizer as a ml ...
Github user feynmanliang commented on a diff in the pull request: https://github.com/apache/spark/pull/7084#discussion_r33722239 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala --- @@ -0,0 +1,73 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.ml.feature + +import scala.collection.mutable + +import org.apache.spark.annotation.Experimental +import org.apache.spark.ml.param.{IntParam, ParamMap, ParamValidators} +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.mllib.linalg.Vectors +import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.functions._ + +/** + * :: Experimental :: + * Converts a text document to a sparse vector of token counts. + * @param vocabulary An Array over terms. Only the terms in the vocabulary will be counted. + */ +@Experimental +class CountVectorizer (override val uid: String, vocabulary: Array[String]) extends HashingTF{ + + def this(vocabulary: Array[String]) = this(Identifiable.randomUID(countVectorizer), vocabulary) --- End diff -- This is probably fine for now, but I had some thoughts about having an empty constructor for including every word encountered if no vocabulary is provided. If it requires significant modification, we should make a separate JIRA for it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8660] [MLLib] removed symbols from co...
Github user Rosstin commented on the pull request: https://github.com/apache/spark/pull/7167#issuecomment-117820353 @mengxr Would it be desirable to un-multiline the LOC in the file's comments? Or should these remain multiline to follow style? (What I mean is, the lines are long enough that they were being broken into multiple lines, so copy-pasting them would be harder. I made them back into single-line.) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8425][Core][WIP] Add blacklist mechanis...
Github user squito commented on a diff in the pull request: https://github.com/apache/spark/pull/6870#discussion_r33723836 --- Diff: core/src/test/scala/org/apache/spark/scheduler/ExecutorBlacklistTrackerSuite.scala --- @@ -0,0 +1,158 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.scheduler + +import scala.collection.mutable + +import org.scalatest.{BeforeAndAfter, PrivateMethodTester} + +import org.apache.spark._ +import org.apache.spark.scheduler.ExecutorBlacklistTracker.ExecutorFailureStatus +import org.apache.spark.scheduler.cluster.ExecutorInfo +import org.apache.spark.util.ManualClock + +class ExecutorBlacklistTrackerSuite + extends SparkFunSuite + with LocalSparkContext + with BeforeAndAfter { + import ExecutorBlacklistTrackerSuite._ + + before { +if (sc == null) { + sc = createSparkContext +} + } + + after { +if (sc != null) { + sc.stop() + sc = null +} + } + + test(add executor to blacklist) { +// Add 5 executors +addExecutors(5) +val tracker = sc.executorBlacklistTracker.get +assert(numExecutorsRegistered(tracker) === 5) + +// Post 5 TaskEnd event to executor-1 to add executor-1 into blacklist +(0 until 5).foreach(_ = postTaskEndEvent(TaskResultLost, executor-1)) + +assert(tracker.getExecutorBlacklist === Set(executor-1)) +assert(executorIdToTaskFailures(tracker)(executor-1).numFailures === 5) +assert(executorIdToTaskFailures(tracker)(executor-1).isBlackListed === true) + +// Post 10 TaskEnd event to executor-2 to add executor-2 into blacklist +(0 until 10).foreach(_ = postTaskEndEvent(TaskResultLost, executor-2)) +assert(tracker.getExecutorBlacklist === Set(executor-1, executor-2)) +assert(executorIdToTaskFailures(tracker)(executor-2).numFailures === 10) +assert(executorIdToTaskFailures(tracker)(executor-2).isBlackListed === true) + +// Post 5 TaskEnd event to executor-3 to verify whether executor-3 is blacklisted +(0 until 5).foreach(_ = postTaskEndEvent(TaskResultLost, executor-3)) +// Since the failure number of executor-3 is less than the average blacklist threshold, +// though exceed the fault threshold, still should not be added into blacklist +assert(tracker.getExecutorBlacklist === Set(executor-1, executor-2)) +assert(executorIdToTaskFailures(tracker)(executor-3).numFailures === 5) +assert(executorIdToTaskFailures(tracker)(executor-3).isBlackListed === false) + +// Keep post TaskEnd event to executor-3 to add executor-3 into blacklist +(0 until 2).foreach(_ = postTaskEndEvent(TaskResultLost, executor-3)) +assert(tracker.getExecutorBlacklist === Set(executor-1, executor-2, executor-3)) +assert(executorIdToTaskFailures(tracker)(executor-3).numFailures === 7) +assert(executorIdToTaskFailures(tracker)(executor-3).isBlackListed === true) + +// Post TaskEnd event to executor-4 to verify whether executor-4 could be added into blacklist +(0 until 10).foreach(_ = postTaskEndEvent(TaskResultLost, executor-4)) +// Event executor-4's failure task number is above than blacklist threshold, +// but the blacklisted executor number is reaching to maximum fraction, +// so executor-4 still cannot be added into blacklist. --- End diff -- just thinking aloud -- I wonder if this is really the best we can do. If there are too many executors blacklisted, does it make more sense to just kill the job or the spark context? I suppose that if you really keep having a lot of failures past this point, eventually you'll trigger the 4 task failures on one node which will kill the job, so maybe this is reasonable? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and
[GitHub] spark pull request: [SPARK-8029][core] shuffleoutput per attempt
Github user squito commented on the pull request: https://github.com/apache/spark/pull/6648#issuecomment-117826371 Jenkins, retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8660] [MLLib] removed symbols from co...
Github user holdenk commented on the pull request: https://github.com/apache/spark/pull/7167#issuecomment-117826411 For copypasteing in Scala mode :paste mode makes the multi-line copy/past work well (although requires remembering that + ctrl-d) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7977] [Build] Disallowing println
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/7093#discussion_r33725936 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveUDFSuite.scala --- @@ -253,7 +253,9 @@ class PairUDF extends GenericUDF { ) override def evaluate(args: Array[DeferredObject]): AnyRef = { +// scalastyle:off println println(Type = %s.format(args(0).getClass.getName)) --- End diff -- This is a big change. I still think a number of these printlns are debug leftovers and can be removed, or in some cases turned into logging. I don't know of a good way to review these other than to just sift through this a few times. I think anything in a main() method or close support of a CLI utility can stay; examples too are probably OK with println. Obviously there are some methods whose job it is to print to the console directly. Everything else, not sure where you would generally allow println. So, this is one I think can be removed? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7977] [Build] Disallowing println
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/7093#discussion_r33725957 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertIntoHiveTableSuite.scala --- @@ -87,7 +87,9 @@ class InsertIntoHiveTableSuite extends QueryTest with BeforeAndAfter { sql(CREATE TABLE doubleCreateAndInsertTest (key int, value string)) }.getMessage +// scalastyle:off println --- End diff -- Remove? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1537 [WiP] Application Timeline Server i...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5423#issuecomment-117828706 [Test build #36303 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36303/consoleFull) for PR 5423 at commit [`e7897ae`](https://github.com/apache/spark/commit/e7897ae1493a7fef4f51e41ca72c4e98d028eda7). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7977] [Build] Disallowing println
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/7093#discussion_r33726438 --- Diff: core/src/test/scala/org/apache/spark/util/UtilsSuite.scala --- @@ -684,7 +684,9 @@ class UtilsSuite extends SparkFunSuite with ResetSystemProperties with Logging { val buffer = new CircularBuffer(25) val stream = new java.io.PrintStream(buffer, true, UTF-8) +// scalastyle:off println --- End diff -- Same, false positive, just wondering if this really fails --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5095][MESOS] Support capping cores and ...
Github user andrewor14 commented on the pull request: https://github.com/apache/spark/pull/4027#issuecomment-117832235 @tnachen can you address the comments to simplify this patch using `spark.executor.cores` instead? As discussed above this is an optional setting and reuses the same setting across cluster managers. It's also much simpler to reason about than the two proposed configs in this patch. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8450] [SQL] [PYSARK] cleanup type conve...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7131#issuecomment-117832863 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5095][MESOS] Support capping cores and ...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/4027#discussion_r33728044 --- Diff: core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/CoarseMesosSchedulerBackend.scala --- @@ -109,6 +123,13 @@ private[spark] class CoarseMesosSchedulerBackend( } } + protected def driverUrl: String = AkkaUtils.address( +AkkaUtils.protocol(sc.env.actorSystem), +SparkEnv.driverActorSystemName, +conf.get(spark.driver.host), +conf.get(spark.driver.port), +CoarseGrainedSchedulerBackend.ACTOR_NAME) --- End diff -- I suggested this in #4984, but this could just do a check to see if we're in a test: ``` private val driverUrl: String = { val testing = conf.get(spark.testing, false).toBoolean if (testing) { stub // Mock this class without connecting to a non-existent driver in tests } else { AkkaUtils.address(...) } } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8450] [SQL] [PYSARK] cleanup type conve...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7131#issuecomment-117832900 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8450] [SQL] [PYSARK] cleanup type conve...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7131#issuecomment-117835854 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8450] [SQL] [PYSARK] cleanup type conve...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7131#issuecomment-117835834 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8766] support non-ascii character in co...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7165#issuecomment-117837584 [Test build #36306 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36306/console) for PR 7165 at commit [`3b09d31`](https://github.com/apache/spark/commit/3b09d3146d2cf867d50500b727aeace5f7a8ac16). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8425][Core][WIP] Add blacklist mechanis...
Github user squito commented on a diff in the pull request: https://github.com/apache/spark/pull/6870#discussion_r33729664 --- Diff: core/src/main/scala/org/apache/spark/scheduler/ExecutorBlacklistTracker.scala --- @@ -0,0 +1,175 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.scheduler + +import java.util.concurrent.TimeUnit + +import scala.collection.mutable + +import org.apache.spark._ +import org.apache.spark.util.{Clock, SystemClock, ThreadUtils, Utils} + +/** + * ExecutorBlacklistTracker blacklists the executors by tracking the status of running tasks with + * heuristic algorithm. + * + * A executor will be considered bad enough only when: + * 1. The failure task number on this executor is more than + *spark.scheduler.blacklist.executorFaultThreshold. + * 2. The failure task number on this executor is + *spark.scheduler.blacklist.averageBlacklistThreshold more than average failure task number + *of this cluster. + * + * Also max number of blacklisted executors will not exceed the + * spark.scheduler.blacklist.maxBlacklistFraction of whole cluster, and blacklisted executors + * will be forgiven when there is no failure tasks in the + * spark.scheduler.blacklist.executorFaultTimeoutWindowInMinutes. + */ +private[spark] class ExecutorBlacklistTracker(conf: SparkConf) extends SparkListener { + import ExecutorBlacklistTracker._ + + private val maxBlacklistFraction = conf.getDouble( +spark.scheduler.blacklist.maxBlacklistFraction, MAX_BLACKLIST_FRACTION) + private val avgBlacklistThreshold = conf.getDouble( +spark.scheduler.blacklist.averageBlacklistThreshold, AVERAGE_BLACKLIST_THRESHOLD) + private val executorFaultThreshold = conf.getInt( +spark.scheduler.blacklist.executorFaultThreshold, EXECUTOR_FAULT_THRESHOLD) + private val executorFaultTimeoutWindowInMinutes = conf.getInt( +spark.scheduler.blacklist.executorFaultTimeoutWindowInMinutes, EXECUTOR_FAULT_TIMEOUT_WINDOW) + + // Count the number of executors registered + var numExecutorsRegistered: Int = 0 + + // Track the number of failure tasks and time of latest failure to executor id + val executorIdToTaskFailures = new mutable.HashMap[String, ExecutorFailureStatus]() + + // Clock used to update and exclude the executors which are out of time window. + private var clock: Clock = new SystemClock() + + // Executor that handles the scheduling task + private val executor = ThreadUtils.newDaemonSingleThreadScheduledExecutor( +spark-scheduler-blacklist-expire-timer) + + def start(): Unit = { +val scheduleTask = new Runnable() { + override def run(): Unit = { +Utils.logUncaughtExceptions(expireTimeoutExecutorBlacklist()) + } +} +executor.scheduleAtFixedRate(scheduleTask, 0L, 60, TimeUnit.SECONDS) + } + + def stop(): Unit = { +executor.shutdown() +executor.awaitTermination(10, TimeUnit.SECONDS) + } + + def setClock(newClock: Clock): Unit = { +clock = newClock + } + + def getExecutorBlacklist: Set[String] = synchronized { +executorIdToTaskFailures.filter(_._2.isBlackListed).keys.toSet --- End diff -- this gets called a lot, and (hopefully) is updated only rarely. It should probably be computed only when it changes, and then stored. (Also I think the stored value could probably just be `@volatile`, you wouldn't need to synchronize ... but I'm not 100% sure ...) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail:
[GitHub] spark pull request: [SPARK-8660] [MLLib] removed symbols from co...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7167#issuecomment-117820196 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8746][SQL] update download link for Hiv...
Github user ckadner commented on the pull request: https://github.com/apache/spark/pull/7144#issuecomment-117820180 @JoshRosen -- Hi Josh, thanks for kicking of the tests. Can you help me make sense of the test results? I only changed a markdown file, no test cases should be impacted, no interface methods touched to throw off the MiMa tests. ``` [error] running /home/jenkins/workspace/SparkPullRequestBuilder@2/dev/mima ; received return code 255 Archiving unit tests logs... No log files found. Attempting to post to Github... Post successful. Archiving artifacts WARN: No artifacts found that match the file pattern **/target/unit-tests.log. Configuration error? WARN: java.lang.InterruptedException: no matches found within 1 Recording test results ERROR: Publisher 'Publish JUnit test result report' failed: No test report files were found. Configuration error? Finished: FAILURE ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8695] [core] [WIP] TreeAggregation shou...
GitHub user piganesh opened a pull request: https://github.com/apache/spark/pull/7168 [SPARK-8695] [core] [WIP] TreeAggregation shouldn't be triggered for … …5 partitions You can merge this pull request into a Git repository by running: $ git pull https://github.com/ibmsoe/spark SPARK-8695 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/7168.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #7168 commit a6fed07f65c3c71e19f231c54fa43fc4904b040d Author: Perinkulam I. Ganesh g...@us.ibm.com Date: 2015-07-01T20:50:28Z [SPARK-8695] [core] [WIP] TreeAggregation shouldn't be triggered for 5 partitions --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8029][core] shuffleoutput per attempt
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/6648#issuecomment-117823839 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8029][core] shuffleoutput per attempt
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/6648#issuecomment-117823675 **[Test build #36289 timed out](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36289/console)** for PR 6648 at commit [`55a9bb1`](https://github.com/apache/spark/commit/55a9bb1d8617424e059e7de052ecaca154e4ab44) after a configured wait of `175m`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8029][core] shuffleoutput per attempt
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/6648#issuecomment-117827909 [Test build #36302 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36302/consoleFull) for PR 6648 at commit [`55a9bb1`](https://github.com/apache/spark/commit/55a9bb1d8617424e059e7de052ecaca154e4ab44). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1564] [Docs] Added Javascript to Javado...
GitHub user deroneriksson opened a pull request: https://github.com/apache/spark/pull/7169 [SPARK-1564] [Docs] Added Javascript to Javadocs to create badges for tags like :: Experimental :: Modified copy_api_dirs.rb and created api-javadocs.js and api-javadocs.css files in order to add badges to javadoc files for :: Experimental ::, :: DeveloperApi ::, and :: AlphaComponent :: tags You can merge this pull request into a Git repository by running: $ git pull https://github.com/deroneriksson/spark SPARK-1564_JavaDocs_badges Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/7169.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #7169 commit 65b493072eb385ef25151a6485369f801730c760 Author: Deron Eriksson de...@us.ibm.com Date: 2015-07-01T21:08:43Z Modified copy_api_dirs.rb and created api-javadocs.js and api-javadocs.css files in order to add badges to javadoc files for :: Experimental ::, :: DeveloperApi ::, and :: AlphaComponent :: tags --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6287][MESOS] Add dynamic allocation to ...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/4984#discussion_r33726149 --- Diff: core/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala --- @@ -124,10 +124,16 @@ private[spark] class DiskBlockManager(blockManager: BlockManager, conf: SparkCon (blockId, getFile(blockId)) } + /** + * Create local directories for storing block data. These directories are + * located inside configured local directories and won't + * be deleted on JVM exit when using the external shuffle service. --- End diff -- I just read your comment again. I still don't see how the directory layout is related to cleaning up shuffle files. The reason why we don't clean up shuffle files in Mesos (and standalone mode) is simply because the shuffle service doesn't know when an application exits. When shuffle service is enabled, [executors no longer clean up the shuffle files on exit](https://github.com/apache/spark/blob/1ce6428907b4ddcf52dbf0c86196d82ab7392442/core/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala#L162), so no one cleans these files up anymore. All we need to do then is to add this missing code path. Since the external shuffle service already [knows](https://github.com/apache/spark/blob/1ce6428907b4ddcf52dbf0c86196d82ab7392442/network/shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java#L147) about the `localDirs` on each executor, it can just go ahead and delete these directories (which contain the shuffle files written). Could you explain why the directory structure needs to change? Why is it not sufficient to just remove the shuffle directories? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7977] [Build] Disallowing println
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/7093#issuecomment-117829855 Heroic effort here. It certainly is big but is fixing up a lot of little issues of this form. I flagged a few more questions about it but it is looking pretty good. Is anyone else concerned about the cost of future merge conflicts on this one vs the benefit? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1564] [Docs] Added Javascript to Javado...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7169#issuecomment-117831150 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1564] [Docs] Added Javascript to Javado...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7169#issuecomment-117831126 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-8313] R Spark packages support
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7139#issuecomment-117832007 [Test build #36294 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36294/console) for PR 7139 at commit [`0226768`](https://github.com/apache/spark/commit/0226768c1cddce134253b9d817098880f04de0a5). * This patch **fails SparkR unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-8313] R Spark packages support
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7139#issuecomment-117832118 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6910] [WiP] Reduce number of operations...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/7049#discussion_r33727646 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala --- @@ -379,31 +379,25 @@ abstract class HadoopFsRelation private[sql](maybePartitionSpec: Option[Partitio var leafDirToChildrenFiles = mutable.Map.empty[Path, Array[FileStatus]] def refresh(): Unit = { - // We don't filter files/directories whose name start with _ except _temporary here, as - // specific data sources may take advantages over them (e.g. Parquet _metadata and - // _common_metadata files). _temporary directories are explicitly ignored since failed - // tasks/jobs may leave partial/corrupted data files there. - def listLeafFilesAndDirs(fs: FileSystem, status: FileStatus): Set[FileStatus] = { -if (status.getPath.getName.toLowerCase == _temporary) { - Set.empty -} else { - val (dirs, files) = fs.listStatus(status.getPath).partition(_.isDir) - val leafDirs = if (dirs.isEmpty) Set(status) else Set.empty[FileStatus] - files.toSet ++ leafDirs ++ dirs.flatMap(dir = listLeafFilesAndDirs(fs, dir)) -} - } leafFiles.clear() - val statuses = paths.flatMap { path = + val statuses = paths.par.flatMap { path = val hdfsPath = new Path(path) val fs = hdfsPath.getFileSystem(hadoopConf) val qualified = hdfsPath.makeQualified(fs.getUri, fs.getWorkingDirectory) - Try(fs.getFileStatus(qualified)).toOption.toArray.flatMap(listLeafFilesAndDirs(fs, _)) +val it = fs.listFiles(qualified, true) --- End diff -- Unfortunately this API doesn't exist in Hadoop 1... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8450] [SQL] [PYSARK] cleanup type conve...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7131#issuecomment-117833346 [Test build #36305 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36305/consoleFull) for PR 7131 at commit [`7104e97`](https://github.com/apache/spark/commit/7104e97f01f47c6405bc8f8e51b5485ddca27efe). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8766] support non-ascii character in co...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7165#issuecomment-117834553 [Test build #36306 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36306/consoleFull) for PR 7165 at commit [`3b09d31`](https://github.com/apache/spark/commit/3b09d3146d2cf867d50500b727aeace5f7a8ac16). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8425][Core][WIP] Add blacklist mechanis...
Github user squito commented on a diff in the pull request: https://github.com/apache/spark/pull/6870#discussion_r33729199 --- Diff: core/src/main/scala/org/apache/spark/scheduler/ExecutorBlacklistTracker.scala --- @@ -0,0 +1,175 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.scheduler + +import java.util.concurrent.TimeUnit + +import scala.collection.mutable + +import org.apache.spark._ +import org.apache.spark.util.{Clock, SystemClock, ThreadUtils, Utils} + +/** + * ExecutorBlacklistTracker blacklists the executors by tracking the status of running tasks with + * heuristic algorithm. + * + * A executor will be considered bad enough only when: + * 1. The failure task number on this executor is more than + *spark.scheduler.blacklist.executorFaultThreshold. + * 2. The failure task number on this executor is + *spark.scheduler.blacklist.averageBlacklistThreshold more than average failure task number + *of this cluster. + * + * Also max number of blacklisted executors will not exceed the + * spark.scheduler.blacklist.maxBlacklistFraction of whole cluster, and blacklisted executors + * will be forgiven when there is no failure tasks in the + * spark.scheduler.blacklist.executorFaultTimeoutWindowInMinutes. + */ +private[spark] class ExecutorBlacklistTracker(conf: SparkConf) extends SparkListener { + import ExecutorBlacklistTracker._ + + private val maxBlacklistFraction = conf.getDouble( +spark.scheduler.blacklist.maxBlacklistFraction, MAX_BLACKLIST_FRACTION) + private val avgBlacklistThreshold = conf.getDouble( +spark.scheduler.blacklist.averageBlacklistThreshold, AVERAGE_BLACKLIST_THRESHOLD) + private val executorFaultThreshold = conf.getInt( +spark.scheduler.blacklist.executorFaultThreshold, EXECUTOR_FAULT_THRESHOLD) + private val executorFaultTimeoutWindowInMinutes = conf.getInt( +spark.scheduler.blacklist.executorFaultTimeoutWindowInMinutes, EXECUTOR_FAULT_TIMEOUT_WINDOW) --- End diff -- these new confs need to be documented (along with spark.scheduler.blacklist.enabled) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-8313] R Spark packages support
Github user shivaram commented on the pull request: https://github.com/apache/spark/pull/7139#issuecomment-117835313 @JoshRosen @shaneknapp -- So in this case the SparkR unit tests failed but the AmpLabJenkins message says Test Passed ? Do you know what could cause this ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8660] [MLLib] removed symbols from co...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/7167#issuecomment-117837038 Let's keep the line width within 100. As @holdenk mentioned, we can copy paste a paragraph of code to Scala and ipython easily. I also tried RStudio, which takes multiline statement as well. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8450] [SQL] [PYSARK] cleanup type conve...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7131#issuecomment-117836993 [Test build #36307 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36307/consoleFull) for PR 7131 at commit [`7d73168`](https://github.com/apache/spark/commit/7d73168de9b45cb15229f9fe2e7e97304aa8c375). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3071] Increase default driver memory
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7132#issuecomment-117818124 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3071] Increase default driver memory
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7132#issuecomment-117818081 [Test build #36291 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36291/console) for PR 7132 at commit [`fd67721`](https://github.com/apache/spark/commit/fd67721c3d63b5aaca092c72c424bdeb726440c8). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7977] [Build] Disallowing println
Github user jonalter commented on the pull request: https://github.com/apache/spark/pull/7093#issuecomment-117820014 @srowen - I have addressed the point you made regarding using log rather than println in some cases. Please let me know what you think. Thank you. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8746][SQL] update download link for Hiv...
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/7144#issuecomment-117821666 This is a rare build flakiness issue that we haven't diagnosed. I kicked off tests just to test out a pull request builder script change. This change looks fine to me, so I think we should merge this into master and 1.4. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8695] [core] [WIP] TreeAggregation shou...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7168#issuecomment-117823182 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-8313] R Spark packages support
Github user shivaram commented on the pull request: https://github.com/apache/spark/pull/7139#issuecomment-117825520 @brkyvz Thanks for sending out this PR. Its looking good. I had a couple of high level points 1. It might be good to point out in some of the error messages how the JAR should be structured for this to work. I know that it works out of the box with the SBT plugin but it will be good explain this for users who aren't using the plugin 2. Does this also do the package install on all the executors ? Its not important right now with the DataFrame API but some of the work we do in the future will run R code in the executors. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5016][MLLib] Distribute GMM mixture com...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7166#issuecomment-117825624 [Test build #36301 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36301/console) for PR 7166 at commit [`1da3c7f`](https://github.com/apache/spark/commit/1da3c7f609f55d2be3c95d9265d256f6d2dc669f). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7977] [Build] Disallowing println
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/7093#discussion_r33726211 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/util/NumericParserSuite.scala --- @@ -33,7 +33,9 @@ class NumericParserSuite extends SparkFunSuite { malformatted.foreach { s = intercept[SparkException] { NumericParser.parse(s) +// scalastyle:off println println(sDidn't detect malformatted string $s.) --- End diff -- Throw an exception even? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1537 [WiP] Application Timeline Server i...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5423#issuecomment-117828238 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5095][MESOS] Support capping cores and ...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/4027#discussion_r33728151 --- Diff: core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/CoarseMesosSchedulerBackend.scala --- @@ -211,35 +226,43 @@ private[spark] class CoarseMesosSchedulerBackend( for (offer - offers) { val slaveId = offer.getSlaveId.toString -val mem = getResource(offer.getResourcesList, mem) -val cpus = getResource(offer.getResourcesList, cpus).toInt -if (totalCoresAcquired maxCores -mem = MemoryUtils.calculateTotalMemory(sc) -cpus = 1 +var remainingMem = getResource(offer.getResourcesList, mem) +var remainingCores = getResource(offer.getResourcesList, cpus).toInt + --- End diff -- can you nix these blank lines? L233 too --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8766] support non-ascii character in co...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7165#issuecomment-117833781 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1564] [Docs] Added Javascript to Javado...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7169#issuecomment-117833756 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8425][Core][WIP] Add blacklist mechanis...
Github user squito commented on a diff in the pull request: https://github.com/apache/spark/pull/6870#discussion_r33728907 --- Diff: core/src/main/scala/org/apache/spark/scheduler/ExecutorBlacklistTracker.scala --- @@ -0,0 +1,175 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.scheduler + +import java.util.concurrent.TimeUnit + +import scala.collection.mutable + +import org.apache.spark._ +import org.apache.spark.util.{Clock, SystemClock, ThreadUtils, Utils} + +/** + * ExecutorBlacklistTracker blacklists the executors by tracking the status of running tasks with + * heuristic algorithm. + * + * A executor will be considered bad enough only when: + * 1. The failure task number on this executor is more than + *spark.scheduler.blacklist.executorFaultThreshold. + * 2. The failure task number on this executor is + *spark.scheduler.blacklist.averageBlacklistThreshold more than average failure task number + *of this cluster. + * + * Also max number of blacklisted executors will not exceed the + * spark.scheduler.blacklist.maxBlacklistFraction of whole cluster, and blacklisted executors + * will be forgiven when there is no failure tasks in the + * spark.scheduler.blacklist.executorFaultTimeoutWindowInMinutes. + */ +private[spark] class ExecutorBlacklistTracker(conf: SparkConf) extends SparkListener { + import ExecutorBlacklistTracker._ + + private val maxBlacklistFraction = conf.getDouble( +spark.scheduler.blacklist.maxBlacklistFraction, MAX_BLACKLIST_FRACTION) + private val avgBlacklistThreshold = conf.getDouble( +spark.scheduler.blacklist.averageBlacklistThreshold, AVERAGE_BLACKLIST_THRESHOLD) + private val executorFaultThreshold = conf.getInt( +spark.scheduler.blacklist.executorFaultThreshold, EXECUTOR_FAULT_THRESHOLD) + private val executorFaultTimeoutWindowInMinutes = conf.getInt( +spark.scheduler.blacklist.executorFaultTimeoutWindowInMinutes, EXECUTOR_FAULT_TIMEOUT_WINDOW) + + // Count the number of executors registered + var numExecutorsRegistered: Int = 0 + + // Track the number of failure tasks and time of latest failure to executor id + val executorIdToTaskFailures = new mutable.HashMap[String, ExecutorFailureStatus]() + + // Clock used to update and exclude the executors which are out of time window. + private var clock: Clock = new SystemClock() + + // Executor that handles the scheduling task + private val executor = ThreadUtils.newDaemonSingleThreadScheduledExecutor( +spark-scheduler-blacklist-expire-timer) + + def start(): Unit = { +val scheduleTask = new Runnable() { + override def run(): Unit = { +Utils.logUncaughtExceptions(expireTimeoutExecutorBlacklist()) + } +} +executor.scheduleAtFixedRate(scheduleTask, 0L, 60, TimeUnit.SECONDS) + } + + def stop(): Unit = { +executor.shutdown() +executor.awaitTermination(10, TimeUnit.SECONDS) + } + + def setClock(newClock: Clock): Unit = { +clock = newClock + } + + def getExecutorBlacklist: Set[String] = synchronized { +executorIdToTaskFailures.filter(_._2.isBlackListed).keys.toSet + } + + override def onTaskEnd(taskEnd: SparkListenerTaskEnd): Unit = synchronized { +taskEnd.reason match { + case _: FetchFailed | _: ExceptionFailure | TaskResultLost | + _: ExecutorLostFailure | UnknownReason = +val failureStatus = executorIdToTaskFailures.getOrElseUpdate(taskEnd.taskInfo.executorId, + new ExecutorFailureStatus) +failureStatus.numFailures += 1 +failureStatus.updatedTime = clock.getTimeMillis() + +// Update the executor blacklist +updateExecutorBlacklist() + case _ = Unit +} + } + + override def onExecutorAdded(executorAdded: SparkListenerExecutorAdded): Unit =
[GitHub] spark pull request: [SPARK-8425][Core][WIP] Add blacklist mechanis...
Github user squito commented on a diff in the pull request: https://github.com/apache/spark/pull/6870#discussion_r33729777 --- Diff: core/src/test/scala/org/apache/spark/scheduler/ExecutorBlacklistTrackerSuite.scala --- @@ -0,0 +1,158 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.scheduler + +import scala.collection.mutable + +import org.scalatest.{BeforeAndAfter, PrivateMethodTester} + +import org.apache.spark._ +import org.apache.spark.scheduler.ExecutorBlacklistTracker.ExecutorFailureStatus +import org.apache.spark.scheduler.cluster.ExecutorInfo +import org.apache.spark.util.ManualClock + +class ExecutorBlacklistTrackerSuite + extends SparkFunSuite + with LocalSparkContext + with BeforeAndAfter { + import ExecutorBlacklistTrackerSuite._ + + before { +if (sc == null) { + sc = createSparkContext +} + } + + after { +if (sc != null) { + sc.stop() + sc = null +} + } + + test(add executor to blacklist) { +// Add 5 executors +addExecutors(5) +val tracker = sc.executorBlacklistTracker.get +assert(numExecutorsRegistered(tracker) === 5) + +// Post 5 TaskEnd event to executor-1 to add executor-1 into blacklist +(0 until 5).foreach(_ = postTaskEndEvent(TaskResultLost, executor-1)) + +assert(tracker.getExecutorBlacklist === Set(executor-1)) +assert(executorIdToTaskFailures(tracker)(executor-1).numFailures === 5) +assert(executorIdToTaskFailures(tracker)(executor-1).isBlackListed === true) + +// Post 10 TaskEnd event to executor-2 to add executor-2 into blacklist +(0 until 10).foreach(_ = postTaskEndEvent(TaskResultLost, executor-2)) +assert(tracker.getExecutorBlacklist === Set(executor-1, executor-2)) +assert(executorIdToTaskFailures(tracker)(executor-2).numFailures === 10) +assert(executorIdToTaskFailures(tracker)(executor-2).isBlackListed === true) + +// Post 5 TaskEnd event to executor-3 to verify whether executor-3 is blacklisted +(0 until 5).foreach(_ = postTaskEndEvent(TaskResultLost, executor-3)) +// Since the failure number of executor-3 is less than the average blacklist threshold, +// though exceed the fault threshold, still should not be added into blacklist +assert(tracker.getExecutorBlacklist === Set(executor-1, executor-2)) +assert(executorIdToTaskFailures(tracker)(executor-3).numFailures === 5) +assert(executorIdToTaskFailures(tracker)(executor-3).isBlackListed === false) + +// Keep post TaskEnd event to executor-3 to add executor-3 into blacklist +(0 until 2).foreach(_ = postTaskEndEvent(TaskResultLost, executor-3)) +assert(tracker.getExecutorBlacklist === Set(executor-1, executor-2, executor-3)) +assert(executorIdToTaskFailures(tracker)(executor-3).numFailures === 7) +assert(executorIdToTaskFailures(tracker)(executor-3).isBlackListed === true) + +// Post TaskEnd event to executor-4 to verify whether executor-4 could be added into blacklist +(0 until 10).foreach(_ = postTaskEndEvent(TaskResultLost, executor-4)) +// Event executor-4's failure task number is above than blacklist threshold, +// but the blacklisted executor number is reaching to maximum fraction, +// so executor-4 still cannot be added into blacklist. +assert(tracker.getExecutorBlacklist === Set(executor-1, executor-2, executor-3)) +assert(executorIdToTaskFailures(tracker)(executor-4).numFailures === 10) +assert(executorIdToTaskFailures(tracker)(executor-4).isBlackListed === false) + } + + test(remove executor from blacklist) { +// Add 5 executors +addExecutors(5) +val tracker = sc.executorBlacklistTracker.get +val clock = new ManualClock(1L) +tracker.setClock(clock) +assert(numExecutorsRegistered(tracker) === 5) +
[GitHub] spark pull request: [SPARK-8766] support non-ascii character in co...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7165#issuecomment-117837626 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8766] support non-ascii character in co...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7165#issuecomment-117818193 [Test build #36299 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36299/console) for PR 7165 at commit [`867754a`](https://github.com/apache/spark/commit/867754aa2b852e971c6d4359d67a469b1f709611). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8766] support non-ascii character in co...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7165#issuecomment-117818218 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6287][MESOS] Add dynamic allocation to ...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/4984#discussion_r33723361 --- Diff: core/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala --- @@ -124,10 +124,16 @@ private[spark] class DiskBlockManager(blockManager: BlockManager, conf: SparkCon (blockId, getFile(blockId)) } + /** + * Create local directories for storing block data. These directories are + * located inside configured local directories and won't + * be deleted on JVM exit when using the external shuffle service. --- End diff -- I just read it again. You mentioned that on Mesos the shuffle dir is not deleted, but its parent directory is. I'm confused about two things: First, if the parent directory is deleted, wouldn't everything it automatically be deleted as well? Second, I actually don't see how the parent directory (the middle layer in your comment) is deleted. Are you referring to `DiskBlockManager#doStop()`? AFAIK that only deletes the temporary directory created inside of the parent directory (i.e. the shuffle dir), and we don't do this if external shuffle service. I can't find the code that deletes the middle layer itself. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org