[GitHub] spark pull request: [SPARK-10048][SPARKR] Support arbitrary nested...
Github user shivaram commented on the pull request: https://github.com/apache/spark/pull/8276#issuecomment-132297061 Thanks @sun-rui I'll take a look at this today cc @davies --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [CORE] Disable spark.shuffle.reduceLocality.en...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8280#issuecomment-132305977 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8167] Make tasks that fail from YARN pr...
Github user mccheah commented on the pull request: https://github.com/apache/spark/pull/8007#issuecomment-132307731 Cool - and once again, trying the new commit would be appreciated. Also @markgrover how do we want to resolve all of the duplicate work being done here and in #8093? Should we try to merge this commit first and have your commit be rebased on top of this? Or should it be the other way around? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9906] [ML] User guide for LogisticRegre...
Github user feynmanliang commented on a diff in the pull request: https://github.com/apache/spark/pull/8197#discussion_r37336212 --- Diff: docs/ml-linear-methods.md --- @@ -118,12 +133,114 @@ lrModel = lr.fit(training) print(Weights: + str(lrModel.weights)) print(Intercept: + str(lrModel.intercept)) {% endhighlight %} +/div /div +The `spark.ml` implementation of logistic regression also supports +extracting a summary of the model over the training set. Note that the +predictions and metrics which are stored as `Datafram`s in --- End diff -- Whoops, yep that's right --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8473][SPARK-9889][ML] User guide and ex...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/8184#discussion_r37337185 --- Diff: docs/ml-features.md --- @@ -649,6 +649,80 @@ for expanded in polyDF.select(polyFeatures).take(3): /div /div +## Discrete Cosine Transform (DCT) + +The [Discrete Cosine +Transform](https://en.wikipedia.org/wiki/Discrete_cosine_transform) +transforms a length $N$ real-valued sequence in the time domain into +another length $N$ real-valued sequence in the frequency domain. A +[DCT](api/scala/index.html#org.apache.spark.ml.feature.DCT) class +provides this functionality, implementing the +[DCT-II](https://en.wikipedia.org/wiki/Discrete_cosine_transform#DCT-II) +and scaling the result by $1/\sqrt{2}$ such that the representing matrix +for the transform is unitary. No shift is applied to the transformed +sequence (e.g. the $0$th element of the transformed sequence is the +$0$th DCT coefficient and _not_ the $N/2$th). + +div class=codetabs +div data-lang=scala markdown=1 +{% highlight scala %} +import org.apache.spark.ml.feature.DCT +import org.apache.spark.mllib.linalg.Vectors + +val data = Seq( + Vectors.dense(0.0, 1.0, -2.0, 3.0), + Vectors.dense(-1.0, 2.0, 4.0, -7.0), + Vectors.dense(14.0, -2.0, -5.0, 1.0)) +val df = sqlContext.createDataFrame(data.map(Tuple1.apply)).toDF(features) +val dct = new DCT() + .setInputCol(features) + .setOutputCol(featuresDCT) + .setInverse(false) +val dctDf = dct.transform(df) +dctDf.select(featuresDCT).take(3).foreach(println) +{% endhighlight %} +/div + +div data-lang=java markdown=1 +{% highlight java %} +import java.util.Arrays; + +import org.apache.spark.api.java.JavaRDD; +import org.apache.spark.api.java.JavaSparkContext; +import org.apache.spark.ml.feature.DCT; +import org.apache.spark.mllib.linalg.Vector; +import org.apache.spark.mllib.linalg.VectorUDT; +import org.apache.spark.mllib.linalg.Vectors; +import org.apache.spark.sql.DataFrame; +import org.apache.spark.sql.Row; +import org.apache.spark.sql.RowFactory; +import org.apache.spark.sql.SQLContext; +import org.apache.spark.sql.types.Metadata; +import org.apache.spark.sql.types.StructField; +import org.apache.spark.sql.types.StructType; + +JavaRDDRow data = jsc.parallelize(Arrays.asList( + RowFactory.create(Vectors.dense(0.0, 1.0, -2.0, 3.0)), --- End diff -- 2 space indentation --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9952] Fix N^2 loop when DAGScheduler.ge...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8178#issuecomment-132319392 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41138/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10087] [CORE] Disable spark.shuffle.red...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/8280#issuecomment-132320481 cc @shivaram --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10082] [MLlib] Validate i, j in apply D...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8271#issuecomment-132320541 [Test build #41155 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41155/consoleFull) for PR 8271 at commit [`cf7f509`](https://github.com/apache/spark/commit/cf7f5091174247ea945b8b4ae01eb02f64a07711). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10089] [sql] Add missing golden files.
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8283#issuecomment-132320531 [Test build #41154 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41154/consoleFull) for PR 8283 at commit [`be061b3`](https://github.com/apache/spark/commit/be061b3da928d645e2029ef37ac661a4cb84bb24). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10087] [CORE] Disable spark.shuffle.red...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/8280#issuecomment-132324006 The reduce stage i has a 2-way join in it. The two map stages had 30 and 1 tasks, respectively. For the stage having 30 tasks, here is the screenshot of task info ![image](https://cloud.githubusercontent.com/assets/2072857/9340299/c3c52c7a-45a3-11e5-8ee8-425fcd44612c.png) For the stage having 1 task, here is the screenshot of task info ![image](https://cloud.githubusercontent.com/assets/2072857/9340324/e332f010-45a3-11e5-97e9-40adb5461975.png) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10087] [CORE] Disable spark.shuffle.red...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/8280#issuecomment-132328891 ah, sorry i missed the reducer stage's screenshot. Yes, executor 23 was the one got all reduce tasks. ![image](https://cloud.githubusercontent.com/assets/2072857/9340710/48fbb2a4-45a6-11e5-87fc-41d8b41ef6a6.png) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10087] [CORE] Disable spark.shuffle.red...
Github user shivaram commented on the pull request: https://github.com/apache/spark/pull/8280#issuecomment-132331662 So my hypothesis right now is that the RDD in the reduce stage has two Shuffle dependencies and the first shuffle dependency happens to be the single map task stage -- so the locality preference ends up giving all the tasks to the single host. Hmm so my guess is that we need to be able to differentiate among different shuffle dependencies ideally. Here is another suggestion: Can we turn this off if we have more than one shuffle dependency ? it should be pretty cheap to count that --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9439] [yarn] External shuffle service r...
Github user squito commented on the pull request: https://github.com/apache/spark/pull/7943#issuecomment-132333169 Jenkins, retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10085] [MLlib] [Docs] removed unnecessa...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/8284#issuecomment-132333113 LGTM. Merged into master and branch-1.5. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10085] [MLlib] [Docs] removed unnecessa...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/8284 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10085] [MLlib] [Docs] removed unnecessa...
Github user stared commented on the pull request: https://github.com/apache/spark/pull/8284#issuecomment-132333595 Wow, it was quick! Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9893] user guide for VectorSlicer
Github user feynmanliang commented on a diff in the pull request: https://github.com/apache/spark/pull/8267#discussion_r37346802 --- Diff: docs/ml-features.md --- @@ -1389,3 +1389,145 @@ print(output.select(features, clicked).first()) # Feature Selectors +## VectorSlicer + +`VectorSlicer` is a transformer that takes a feature vector and outputs a new feature vector with a sub-array of the original features. It is useful for extracting features from a vector column. + +`VectorSlicer` accepts a vector column with a specified indices, then outputs a new vector column whose values are selected via those indices. There are two types of indices, + + 1. Integer indices that represents the real indices in the vector, `setIndices()`; + + 2. String indices that represents the names of features in the vector, `setNames()`. + +Specify by integer and string are both acceptable, moreover, you can use integer index and string name simultaneously. At least one feature must be selected. Duplicate features are not allowed, so there can be no overlap between selected indices and names. Note that if names of features are selected, an exception will be threw out when encountering with empty input attributes. --- End diff -- nit: ***Specification*** by integer and string are both acceptable***. M***oreover, --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9439] [yarn] External shuffle service r...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7943#issuecomment-132335105 [Test build #41158 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41158/consoleFull) for PR 7943 at commit [`0d285d3`](https://github.com/apache/spark/commit/0d285d3fac15afc77313255799a3392dcf74518f). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8473][SPARK-9889][ML] User guide and ex...
Github user feynmanliang commented on a diff in the pull request: https://github.com/apache/spark/pull/8184#discussion_r37349464 --- Diff: docs/ml-features.md --- @@ -649,6 +649,80 @@ for expanded in polyDF.select(polyFeatures).take(3): /div /div +## Discrete Cosine Transform (DCT) + +The [Discrete Cosine +Transform](https://en.wikipedia.org/wiki/Discrete_cosine_transform) +transforms a length $N$ real-valued sequence in the time domain into +another length $N$ real-valued sequence in the frequency domain. A +[DCT](api/scala/index.html#org.apache.spark.ml.feature.DCT) class +provides this functionality, implementing the +[DCT-II](https://en.wikipedia.org/wiki/Discrete_cosine_transform#DCT-II) +and scaling the result by $1/\sqrt{2}$ such that the representing matrix +for the transform is unitary. No shift is applied to the transformed +sequence (e.g. the $0$th element of the transformed sequence is the +$0$th DCT coefficient and _not_ the $N/2$th). + +div class=codetabs +div data-lang=scala markdown=1 +{% highlight scala %} +import org.apache.spark.ml.feature.DCT +import org.apache.spark.mllib.linalg.Vectors + +val data = Seq( + Vectors.dense(0.0, 1.0, -2.0, 3.0), + Vectors.dense(-1.0, 2.0, 4.0, -7.0), + Vectors.dense(14.0, -2.0, -5.0, 1.0)) +val df = sqlContext.createDataFrame(data.map(Tuple1.apply)).toDF(features) +val dct = new DCT() + .setInputCol(features) + .setOutputCol(featuresDCT) + .setInverse(false) +val dctDf = dct.transform(df) +dctDf.select(featuresDCT).take(3).foreach(println) +{% endhighlight %} +/div + +div data-lang=java markdown=1 +{% highlight java %} +import java.util.Arrays; + +import org.apache.spark.api.java.JavaRDD; +import org.apache.spark.api.java.JavaSparkContext; +import org.apache.spark.ml.feature.DCT; +import org.apache.spark.mllib.linalg.Vector; +import org.apache.spark.mllib.linalg.VectorUDT; +import org.apache.spark.mllib.linalg.Vectors; +import org.apache.spark.sql.DataFrame; +import org.apache.spark.sql.Row; +import org.apache.spark.sql.RowFactory; +import org.apache.spark.sql.SQLContext; +import org.apache.spark.sql.types.Metadata; +import org.apache.spark.sql.types.StructField; +import org.apache.spark.sql.types.StructType; + +JavaRDDRow data = jsc.parallelize(Arrays.asList( + RowFactory.create(Vectors.dense(0.0, 1.0, -2.0, 3.0)), --- End diff -- OK --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9906] [ML] User guide for LogisticRegre...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8197#issuecomment-132343523 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10087] [CORE] Disable spark.shuffle.red...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8280#issuecomment-132346850 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41149/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10088] [sql] Add support for stored as...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8282#issuecomment-132354966 [Test build #41153 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41153/console) for PR 8282 at commit [`2256b43`](https://github.com/apache/spark/commit/2256b430bd7e98ca0bc92dc74bdf7340f9d134cf). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8167] Make tasks that fail from YARN pr...
Github user mccheah commented on a diff in the pull request: https://github.com/apache/spark/pull/8007#discussion_r37358327 --- Diff: core/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala --- @@ -91,6 +92,66 @@ private[spark] abstract class YarnSchedulerBackend( } /** + * Override the DriverEndpoint to add extra logic for the case when an executor is disconnected. + * We should check the cluster manager and find if the loss of the executor was caused by YARN + * force killing it due to preemption. + */ + private class YarnDriverEndpoint(rpcEnv: RpcEnv, sparkProperties: ArrayBuffer[(String, String)]) + extends DriverEndpoint(rpcEnv, sparkProperties) { + +private val pendingDisconnectedExecutors = new HashSet[String] +private val handleDisconnectedExecutorThreadPool = + ThreadUtils.newDaemonCachedThreadPool(yarn-driver-handle-lost-executor-thread-pool) + +/** + * When onDisconnected is received at the driver endpoint, the superclass DriverEndpoint + * handles it by assuming the Executor was lost for a bad reason and removes the executor + * immediately. + * + * In YARN's case however it is crucial to talk to the application master and ask why the + * executor had exited. In particular, the executor may have exited due to the executor + * having been preempted. If the executor exited normally according to the application + * master then we pass that information down to the TaskSetManager to inform the + * TaskSetManager that tasks on that lost executor should not count towards a job failure. + */ +override def onDisconnected(rpcAddress: RpcAddress): Unit = { --- End diff -- To @markgrover's question, yes, by overriding the method then only this implementation will be invoked. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8918] [MLLIB] [DOC] Add @since tags to ...
GitHub user mengxr opened a pull request: https://github.com/apache/spark/pull/8288 [SPARK-8918] [MLLIB] [DOC] Add @since tags to mllib.clustering This continues the work from #8256. I removed `@since` tags from private/protected/local methods/variables (see https://github.com/apache/spark/commit/72fdeb64630470f6f46cf3eed8ffbfe83a7c4659). @MechCoder Closes #8256 You can merge this pull request into a Git repository by running: $ git pull https://github.com/mengxr/spark SPARK-8918 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/8288.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #8288 commit 6bcb09ba21bf5af6a5662e719abd0eb488c3a6c1 Author: Xiaoqing Wang spark...@126.com Date: 2015-08-17T23:02:26Z SPARK-8918 Add @since tags to mllib.clustering commit c679b975cf2fe5d463bee4ca665dced7504af107 Author: Xiaoqing Wang spark...@126.com Date: 2015-08-18T13:55:58Z update code style,tab replaced by blank commit e430de9fa576ae8752f7d90b342026ff84fab8b5 Author: Xiaoqing Wang spark...@126.com Date: 2015-08-18T14:45:41Z update code style : delete the Whitespace at end of line commit e94968ad540c3e7cd15d7e2015d6705503e459e1 Author: Xiangrui Meng m...@databricks.com Date: 2015-08-18T22:00:54Z Merge remote-tracking branch 'apache/master' into XiaoqingWang-SPARK-8918 commit 72fdeb64630470f6f46cf3eed8ffbfe83a7c4659 Author: Xiangrui Meng m...@databricks.com Date: 2015-08-18T22:05:01Z remove since tags from private vars --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8918] [MLLIB] [DOC] Add @since tags to ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8288#issuecomment-132371671 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10090] [SQL] fix decimal scale of divis...
GitHub user davies opened a pull request: https://github.com/apache/spark/pull/8287 [SPARK-10090] [SQL] fix decimal scale of division In TPCDS Q59, the result should be DecimalType(37, 20), but got Decimal('0.69903637110664268591656984574863203607'), should be Decimal('0.69903637110664268592'). TODO: add regression tests (we have low coverage for DecimalType in Cast) You can merge this pull request into a Git repository by running: $ git pull https://github.com/davies/spark decimal_division Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/8287.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #8287 commit 3d5f0911d6df431273fd2d3753df9c4241a9107c Author: Davies Liu dav...@databricks.com Date: 2015-08-18T22:09:52Z fix decimal precision of division --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9906] [ML] User guide for LogisticRegre...
Github user feynmanliang commented on a diff in the pull request: https://github.com/apache/spark/pull/8197#discussion_r37345734 --- Diff: docs/ml-linear-methods.md --- @@ -118,12 +133,114 @@ lrModel = lr.fit(training) print(Weights: + str(lrModel.weights)) print(Intercept: + str(lrModel.intercept)) {% endhighlight %} +/div /div +The `spark.ml` implementation of logistic regression also supports +extracting a summary of the model over the training set. Note that the +predictions and metrics which are stored as `Datafram`s in +`BinaryLogisticRegressionSummary` are annoted `@transient` and hence +only available on the driver. + +div class=codetabs + +div data-lang=scala markdown=1 + +[`LogisticRegressionTrainingSummary`](api/scala/index.html#org.apache.spark.ml.classification.LogisticRegressionTrainingSummary) +provides a summary for a +[`LogisticRegressionModel`](api/scala/index.html#org.apache.spark.ml.classification.LogisticRegressionModel). +Currently, only binary classification is supported and the +summary must be explicitly cast to +[`BinaryLogisticRegressionTrainingSummary`](api/scala/index.html#org.apache.spark.ml.classification.BinaryLogisticRegressionTrainingSummary). +This will likely change when multiclass classification is supported. + +Continuing the earlier example: + +{% highlight scala %} +// Extract the summary from the returned LogisticRegressionModel instance trained in the earlier example +val trainingSummary = lrModel.summary + +// Obtain the loss per iteration. +val objectiveHistory = trainingSummary.objectiveHistory +objectiveHistory.foreach(loss = println(loss)) + +// Obtain the metrics useful to judge performance on test data. +// We cast the summary to a BinaryLogisticRegressionSummary since the problem is a +// binary classification problem. +val binarySummary = trainingSummary.asInstanceOf[BinaryLogisticRegressionSummary] + +// Obtain the receiver-operating characteristic as a dataframe and areaUnderROC. +val roc = binarySummary.roc +roc.show() +roc.select(FPR).show() +println(binarySummary.areaUnderROC) + +// Get the threshold corresponding to the maximum F-Measure and rerun LogisticRegression with +// this selected threshold. +val fMeasure = binarySummary.fMeasureByThreshold +val maxFMeasure = fMeasure.select(max(F-Measure)).head().getDouble(0) +val bestThreshold = fMeasure.where($F-Measure === maxFMeasure). + select(threshold).head().getDouble(0) +logReg.setThreshold(bestThreshold) +logReg.fit(logRegDataFrame) +{% endhighlight %} /div -### Optimization +div data-lang=java markdown=1 +[`LogisticRegressionTrainingSummary`](api/java/org/apache/spark/ml/classification/LogisticRegressionTrainingSummary.html) +provides a summary for a +[`LogisticRegressionModel`](api/java/org/apache/spark/ml/classification/LogisticRegressionModel.html). +Currently, only binary classification is supported and the +summary must be explicitly cast to +[`BinaryLogisticRegressionTrainingSummary`](api/java/org/apache/spark/ml/classification/BinaryLogisticRegressionTrainingSummary.html). +This will likely change when multiclass classification is supported. + +Continuing the earlier example: + +{% highlight java %} +// Extract the summary from the returned LogisticRegressionModel instance trained in the earlier example +LogisticRegressionTrainingSummary trainingSummary = logRegModel.summary(); + +// Obtain the loss per iteration. +double[] objectiveHistory = trainingSummary.objectiveHistory(); +for (double lossPerIteration : objectiveHistory) { --- End diff -- Nope, see [Google's Java Style Guide](https://google.github.io/styleguide/javaguide.html#s4.6.2-horizontal-whitespace) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9906] [ML] User guide for LogisticRegre...
Github user feynmanliang commented on a diff in the pull request: https://github.com/apache/spark/pull/8197#discussion_r37345587 --- Diff: docs/ml-linear-methods.md --- @@ -118,12 +133,114 @@ lrModel = lr.fit(training) print(Weights: + str(lrModel.weights)) print(Intercept: + str(lrModel.intercept)) {% endhighlight %} +/div /div +The `spark.ml` implementation of logistic regression also supports +extracting a summary of the model over the training set. Note that the +predictions and metrics which are stored as `Datafram`s in +`BinaryLogisticRegressionSummary` are annoted `@transient` and hence +only available on the driver. + +div class=codetabs + +div data-lang=scala markdown=1 + +[`LogisticRegressionTrainingSummary`](api/scala/index.html#org.apache.spark.ml.classification.LogisticRegressionTrainingSummary) +provides a summary for a +[`LogisticRegressionModel`](api/scala/index.html#org.apache.spark.ml.classification.LogisticRegressionModel). +Currently, only binary classification is supported and the +summary must be explicitly cast to +[`BinaryLogisticRegressionTrainingSummary`](api/scala/index.html#org.apache.spark.ml.classification.BinaryLogisticRegressionTrainingSummary). +This will likely change when multiclass classification is supported. --- End diff -- Downcasting is almost always an indication of a poor abstraction and IMO the stabilized API should not require any explicit typecasting by the end user, [here's an explanation](http://codebetter.com/jeremymiller/2006/12/26/downcasting-is-a-code-smell/) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10082] [MLlib] Validate i, j in apply D...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8271#issuecomment-132332983 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41155/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10082] [MLlib] Validate i, j in apply D...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8271#issuecomment-132332979 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10001] [CORE] Allow Ctrl-C in spark-she...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8216#issuecomment-132339281 [Test build #41148 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41148/console) for PR 8216 at commit [`d3eabf0`](https://github.com/apache/spark/commit/d3eabf026fc4806414131833435e1fd0e868957a). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9772] [PySpark] [ML] Add Python API for...
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/8102#issuecomment-132345003 One more comment: need to add VectorSlicer to list ```__all__``` at top of file --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10087] [CORE] Disable spark.shuffle.red...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8280#issuecomment-132346765 [Test build #41149 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41149/console) for PR 8280 at commit [`f77e574`](https://github.com/apache/spark/commit/f77e574dd749c0c140ee71e4aaa143abbfcc6d56). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8473][SPARK-9889][ML] User guide and ex...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8184#issuecomment-132346686 [Test build #41163 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41163/consoleFull) for PR 8184 at commit [`dfb3b2f`](https://github.com/apache/spark/commit/dfb3b2ffe8928142d8e1e96c9a45968056d2336d). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8167] Make tasks that fail from YARN pr...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8007#issuecomment-132348290 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8167] Make tasks that fail from YARN pr...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8007#issuecomment-132348292 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41152/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9906] [ML] User guide for LogisticRegre...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8197#issuecomment-132348205 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41162/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8167] Make tasks that fail from YARN pr...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8007#issuecomment-132348217 [Test build #41152 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41152/console) for PR 8007 at commit [`70d6a15`](https://github.com/apache/spark/commit/70d6a1587906210ea4451fb1743b8eda6e7b90c4). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class ExecutorNormalExit(` * `class ExecutorLossReason(val message: String) extends Serializable ` * `case class ExecutorExitedNormally(val exitCode: Int, reason: String)` * ` case class RemoveExecutor(executorId: String, reason: ExecutorLossReason)` * ` case class AcknowledgeExecutorRemoved(executorId: String) extends CoarseGrainedClusterMessage` * ` case class GetExecutorLossReason(executorId: String) extends CoarseGrainedClusterMessage` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9906] [ML] User guide for LogisticRegre...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8197#issuecomment-132348201 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10095] [SQL] use public API of BigInteg...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/8286#issuecomment-132350637 +1 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8167] Make tasks that fail from YARN pr...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/8007#discussion_r37353473 --- Diff: core/src/main/scala/org/apache/spark/TaskEndReason.scala --- @@ -208,6 +208,22 @@ case class ExecutorLostFailure(execId: String) extends TaskFailedReason { /** * :: DeveloperApi :: + * The task failed because the executor that it was running on was prematurely terminated. The + * executor is forcibly exited but the exit should be considered as part of normal cluster + * behavior. + */ +@DeveloperApi +case class ExecutorNormalExit( --- End diff -- I'd give the same feedback here, but then `ExecutorLostFailure` is a developer API... still I think that a single reason (with a boolean saying whether to treat it as an error) would be simpler. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8473][SPARK-9889][ML] User guide and ex...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8184#issuecomment-132350414 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41163/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10095] [SQL] use public API of BigInteg...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8286#issuecomment-132351145 [Test build #41164 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41164/consoleFull) for PR 8286 at commit [`e547fe8`](https://github.com/apache/spark/commit/e547fe80f59f83fe2b3934215975f9180c5da164). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10082] [MLlib] Validate i, j in apply D...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8271#issuecomment-132353692 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10082] [MLlib] Validate i, j in apply D...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8271#issuecomment-132353694 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41159/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10082] [MLlib] Validate i, j in apply D...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8271#issuecomment-132353469 [Test build #41159 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41159/console) for PR 8271 at commit [`a92c382`](https://github.com/apache/spark/commit/a92c38287c273d82d3e22cf35ebc8216f33d0b2d). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10017] [MLlib]: ML model broadcasts sho...
Github user sabhyankar commented on the pull request: https://github.com/apache/spark/pull/8241#issuecomment-132362277 Thanks for pointing that out @holdenk ! I have pushed a change to the PR! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10060] [ML] [DOC] spark.ml DecisionTree...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/8244#discussion_r37358090 --- Diff: docs/ml-decision-tree.md --- @@ -0,0 +1,506 @@ +--- +layout: global +title: Decision Trees - SparkML +displayTitle: a href=ml-guide.htmlML/a - Decision Trees +--- + +**Table of Contents** + +* This will become a table of contents (this text will be scraped). +{:toc} + + +# Overview + +[Decision trees](http://en.wikipedia.org/wiki/Decision_tree_learning) +and their ensembles are popular methods for the machine learning tasks of +classification and regression. Decision trees are widely used since they are easy to interpret, +handle categorical features, extend to the multiclass classification setting, do not require +feature scaling, and are able to capture non-linearities and feature interactions. Tree ensemble +algorithms such as random forests and boosting are among the top performers for classification and +regression tasks. + +MLlib supports decision trees for binary and multiclass classification and for regression, +using both continuous and categorical features. The implementation partitions data by rows, +allowing distributed training with millions or even billions of instances. + +Users can find more information about the decision tree algorithm in the [MLlib Decision Tree guide](mllib-decision-tree.html). In this section, we demonstrate the Pipelines API for Decision Trees. + +The Pipelines API for Decision Trees offers a bit more functionality than the original API. In particular, for classification, users can get the predicted probability of each class (a.k.a. class conditional probabilities). + +Ensembles of trees (Random Forests and Gradient-Boosted Trees) are described in the [Ensembles guide](ml-ensembles.html). + +# Inputs and Outputs (Predictions) + +We list the input and output (prediction) column types here. +All output columns are optional; to exclude an output column, set its corresponding Param to an empty string. + +## Input Columns + +table class=table + thead +tr + th align=leftParam name/th + th align=leftType(s)/th + th align=leftDefault/th + th align=leftDescription/th +/tr + /thead + tbody +tr + tdlabelCol/td + tdDouble/td + tdlabel/td + tdLabel to predict/td +/tr +tr + tdfeaturesCol/td + tdVector/td + tdfeatures/td + tdFeature vector/td +/tr + /tbody +/table + +## Output Columns + +table class=table + thead +tr + th align=leftParam name/th + th align=leftType(s)/th + th align=leftDefault/th + th align=leftDescription/th + th align=leftNotes/th +/tr + /thead + tbody +tr + tdpredictionCol/td + tdDouble/td + tdprediction/td + tdPredicted label/td + td/td +/tr +tr + tdrawPredictionCol/td + tdVector/td + tdrawPrediction/td + tdVector of length # classes, with the counts of training instance labels at the tree node which makes the prediction/td + tdClassification only/td +/tr +tr + tdprobabilityCol/td + tdVector/td + tdprobability/td + tdVector of length # classes equal to rawPrediction normalized to a multinomial distribution/td + tdClassification only/td +/tr + /tbody +/table + +# Examples + +The below examples demonstrate the Pipelines API for Decision Trees. The main differences between this API and the [original MLlib Decision Tree API](mllib-decision-tree.html) are: + +* support for ML Pipelines +* separation of Decision Trees for classification vs. regression +* use of DataFrame metadata to distinguish continuous and categorical features + + +## Classification + +The following examples load a dataset in LibSVM format, split it into training and test sets, train on the first dataset, and then evaluate on the held-out test set. +We use two feature transformers to prepare the data; these help index categories for the label and categorical features, adding metadata to the `DataFrame` which the Decision Tree algorithm can recognize. + +div class=codetabs +div data-lang=scala markdown=1 + +More details on parameters can be found in the [Scala API documentation](api/scala/index.html#org.apache.spark.ml.classification.DecisionTreeClassifier). + +{% highlight scala %} +import org.apache.spark.ml.Pipeline +import org.apache.spark.ml.classification.DecisionTreeClassifier +import
[GitHub] spark pull request: [SPARK-5754] [yarn] Spark/Yarn/Windows driver/...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8053#issuecomment-132367967 [Test build #41157 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41157/console) for PR 8053 at commit [`5be3e44`](https://github.com/apache/spark/commit/5be3e449aa0306c41398408157a7db1cd94f1aa8). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5754] [yarn] Spark/Yarn/Windows driver/...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8053#issuecomment-132368084 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41157/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9439] [yarn] External shuffle service r...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7943#issuecomment-132333862 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9906] [ML] User guide for LogisticRegre...
Github user MechCoder commented on a diff in the pull request: https://github.com/apache/spark/pull/8197#discussion_r37347155 --- Diff: docs/ml-linear-methods.md --- @@ -118,12 +133,114 @@ lrModel = lr.fit(training) print(Weights: + str(lrModel.weights)) print(Intercept: + str(lrModel.intercept)) {% endhighlight %} +/div /div +The `spark.ml` implementation of logistic regression also supports +extracting a summary of the model over the training set. Note that the +predictions and metrics which are stored as `Datafram`s in +`BinaryLogisticRegressionSummary` are annoted `@transient` and hence +only available on the driver. + +div class=codetabs + +div data-lang=scala markdown=1 + +[`LogisticRegressionTrainingSummary`](api/scala/index.html#org.apache.spark.ml.classification.LogisticRegressionTrainingSummary) +provides a summary for a +[`LogisticRegressionModel`](api/scala/index.html#org.apache.spark.ml.classification.LogisticRegressionModel). +Currently, only binary classification is supported and the +summary must be explicitly cast to +[`BinaryLogisticRegressionTrainingSummary`](api/scala/index.html#org.apache.spark.ml.classification.BinaryLogisticRegressionTrainingSummary). +This will likely change when multiclass classification is supported. + +Continuing the earlier example: + +{% highlight scala %} +// Extract the summary from the returned LogisticRegressionModel instance trained in the earlier example +val trainingSummary = lrModel.summary + +// Obtain the loss per iteration. +val objectiveHistory = trainingSummary.objectiveHistory +objectiveHistory.foreach(loss = println(loss)) + +// Obtain the metrics useful to judge performance on test data. +// We cast the summary to a BinaryLogisticRegressionSummary since the problem is a +// binary classification problem. +val binarySummary = trainingSummary.asInstanceOf[BinaryLogisticRegressionSummary] + +// Obtain the receiver-operating characteristic as a dataframe and areaUnderROC. +val roc = binarySummary.roc +roc.show() +roc.select(FPR).show() +println(binarySummary.areaUnderROC) + +// Get the threshold corresponding to the maximum F-Measure and rerun LogisticRegression with +// this selected threshold. +val fMeasure = binarySummary.fMeasureByThreshold +val maxFMeasure = fMeasure.select(max(F-Measure)).head().getDouble(0) +val bestThreshold = fMeasure.where($F-Measure === maxFMeasure). + select(threshold).head().getDouble(0) +logReg.setThreshold(bestThreshold) +logReg.fit(logRegDataFrame) +{% endhighlight %} /div -### Optimization +div data-lang=java markdown=1 +[`LogisticRegressionTrainingSummary`](api/java/org/apache/spark/ml/classification/LogisticRegressionTrainingSummary.html) +provides a summary for a +[`LogisticRegressionModel`](api/java/org/apache/spark/ml/classification/LogisticRegressionModel.html). +Currently, only binary classification is supported and the +summary must be explicitly cast to +[`BinaryLogisticRegressionTrainingSummary`](api/java/org/apache/spark/ml/classification/BinaryLogisticRegressionTrainingSummary.html). +This will likely change when multiclass classification is supported. + +Continuing the earlier example: + +{% highlight java %} +// Extract the summary from the returned LogisticRegressionModel instance trained in the earlier example +LogisticRegressionTrainingSummary trainingSummary = logRegModel.summary(); + +// Obtain the loss per iteration. +double[] objectiveHistory = trainingSummary.objectiveHistory(); +for (double lossPerIteration : objectiveHistory) { --- End diff -- I see then other places such as this (https://github.com/apache/spark/blob/master/mllib/src/test/java/org/apache/spark/ml/clustering/JavaKMeansSuite.java#L68) have to be changed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10070] [DOCS] Remove Guava dependencies...
Github user feynmanliang commented on the pull request: https://github.com/apache/spark/pull/8272#issuecomment-132335749 LGTM CC @mengxr --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10093][SQL] Avoid transformation on exe...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8285#issuecomment-132340941 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2788] [STREAMING] Add location filterin...
Github user dmvieira commented on the pull request: https://github.com/apache/spark/pull/1717#issuecomment-132340568 I'm starting a third-party package as suggested by @srowen and I hope you enjoy. Feel free to collaborate: https://github.com/dmvieira/spark-twitter-stream-receiver --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10093][SQL] Avoid transformation on exe...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8285#issuecomment-132344077 [Test build #41161 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41161/consoleFull) for PR 8285 at commit [`e8b8240`](https://github.com/apache/spark/commit/e8b8240d389782bfc0e75cbe1797ce5aecc47092). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10080][SQL] Fix binary incompatibility ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8281#issuecomment-132344119 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41150/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8505][SparkR] Add settings to kick `lin...
Github user shaneknapp commented on the pull request: https://github.com/apache/spark/pull/7883#issuecomment-132343883 found one directory on amp-jenkins-worker-01 that's polluted -- deleting it now, and this should fix any builds that run there. On Mon, Aug 17, 2015 at 9:36 PM, shane knapp ☠incompl...@gmail.com wrote: On Mon, Aug 17, 2015 at 10:11 AM, Shivaram Venkataraman notificati...@github.com wrote: @JoshRosen https://github.com/JoshRosen There seems to be some problem on some of the Jenkins workers and we get errors which look like running git clean -fdx warning: failed to remove 'target/' Removing target/ Build step 'Execute shell' marked build as failure I've seen this in other PRs as well -- Any ideas what is causing this ? somehow the spark builds are creating directories w/the wrong permissions (missing the owner write bit), meaning that the directory created from a previous build can't be deleted and thereby fails the build. i'll go through all of the workers/spark build dirs first thing tomorrow and fix this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10080][SQL] Fix binary incompatibility ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8281#issuecomment-132343931 [Test build #41150 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41150/console) for PR 8281 at commit [`541d9a0`](https://github.com/apache/spark/commit/541d9a016b125a3fbbef5cdf97ee3ff9db78b8a0). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `implicit class StringToColumn(val sc: StringContext) ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10087] [CORE] Disable spark.shuffle.red...
Github user shivaram commented on the pull request: https://github.com/apache/spark/pull/8280#issuecomment-132345128 The diff I'm proposing is something like ``` +val numShuffleDeps = rdd.dependencies.filter(_.isInstanceOf[ShuffleDependency[_, _, _]]).length + // If the RDD has shuffle dependencies and shuffle locality is enabled, pick locations that // have at least REDUCER_PREF_LOCS_FRACTION of data as preferred locations -if (shuffleLocalityEnabled rdd.partitions.length SHUFFLE_PREF_REDUCE_THRESHOLD) { +if (numShuffleDeps == 1 shuffleLocalityEnabled +rdd.partitions.length SHUFFLE_PREF_REDUCE_THRESHOLD) { ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9833] [yarn] Add options to disable del...
Github user vanzin commented on the pull request: https://github.com/apache/spark/pull/8134#issuecomment-132347459 @tgravescs I chose a slightly different name than you suggested, how does that sound? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10095] [SQL] use public API of BigInteg...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8286#issuecomment-132349838 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8167] Make tasks that fail from YARN pr...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/8007#discussion_r37353128 --- Diff: core/src/main/scala/org/apache/spark/scheduler/ExecutorLossReason.scala --- @@ -23,13 +23,29 @@ import org.apache.spark.executor.ExecutorExitCode * Represents an explanation for a executor or whole slave failing or exiting. */ private[spark] -class ExecutorLossReason(val message: String) { +class ExecutorLossReason(val message: String) extends Serializable { override def toString: String = message } +private[spark] case class ExecutorExitedAbnormally(val exitCode: Int, reason: String) + extends ExecutorLossReason(reason) { +} + +private[spark] object ExecutorExitedAbnormally { + def apply(exitCode: Int): ExecutorExitedAbnormally = { +ExecutorExitedAbnormally(exitCode, ExecutorExitCode.explainExitCode(exitCode)) + } +} + private[spark] -case class ExecutorExited(val exitCode: Int) - extends ExecutorLossReason(ExecutorExitCode.explainExitCode(exitCode)) { +case class ExecutorExitedNormally(val exitCode: Int, reason: String) + extends ExecutorLossReason(reason) { +} + +private[spark] object ExecutorExitedNormally { --- End diff -- I don't know, I find `ExecutorExitedAbnormally` and `ExecutorExitedNormally` a little confusing, since internally they hold exactly the same data (even the same reason message). What if there was only `ExecutorExited` with a parameter saying whether it should be treated as an error or not? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10095] [SQL] use public API of BigInteg...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8286#issuecomment-132349770 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8167] Make tasks that fail from YARN pr...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/8007#discussion_r37353932 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala --- @@ -809,9 +825,14 @@ private[spark] class TaskSetManager( } } } -// Also re-enqueue any tasks that were running on the node for ((tid, info) - taskInfos if info.running info.executorId == execId) { - handleFailedTask(tid, TaskState.FAILED, ExecutorLostFailure(execId)) + // Also re-enqueue any tasks that were running on the node + val executorFailureReason = reason match { +case exited: ExecutorExitedNormally = --- End diff -- This would go away if you follow my suggestion of merging the two errors, but in any case, since it's a case class: case ExecutorExitedNormally(exitCode, reason) = --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8167] Make tasks that fail from YARN pr...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/8007#discussion_r37355106 --- Diff: core/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala --- @@ -91,6 +92,68 @@ private[spark] abstract class YarnSchedulerBackend( } /** + * Override the DriverEndpoint to add extra logic for the case when an executor is disconnected. + * We should check the cluster manager and find if the loss of the executor was caused by YARN + * force killing it due to preemption. + */ + private class YarnDriverEndpoint(rpcEnv: RpcEnv, sparkProperties: ArrayBuffer[(String, String)]) + extends DriverEndpoint(rpcEnv, sparkProperties) { + +private val pendingDisconnectedExecutors = new HashSet[String] +private val handleDisconnectedExecutorThreadPool = + ThreadUtils.newDaemonCachedThreadPool(yarn-driver-handle-lost-executor-thread-pool) + +/** + * When onDisconnected is received at the driver endpoint, the superclass DriverEndpoint + * handles it by assuming the Executor was lost for a bad reason and removes the executor + * immediately. + * + * In YARN's case however it is crucial to talk to the application master and ask why the + * executor had exited. In particular, the executor may have exited due to the executor + * having been preempted. If the executor exited normally according to the application + * master then we pass that information down to the TaskSetManager to inform the + * TaskSetManager that tasks on that lost executor should not count towards a job failure. + */ +override def onDisconnected(rpcAddress: RpcAddress): Unit = { + addressToExecutorId.get(rpcAddress).foreach({ executorId = +// onDisconnected could be fired multiple times from the same executor while we're +// asynchronously contacting the AM. So keep track of the executors we're trying to +// find loss reasons for and don't duplicate the work +if (!pendingDisconnectedExecutors.contains(executorId)) { + pendingDisconnectedExecutors.add(executorId) + handleDisconnectedExecutorThreadPool.submit(new Runnable() { +override def run(): Unit = { + val executorLossReason = + // Check for the loss reason and pass the loss reason to driverEndpoint + yarnSchedulerEndpoint.askWithRetry[Option[ExecutorLossReason]]( +GetExecutorLossReason(executorId)) + executorLossReason match { +case Some(reason) = + driverEndpoint.askWithRetry[Boolean](RemoveExecutor(executorId, reason)) +case None = + logWarning(sAttempted to get executor loss reason + +s for $rpcAddress but got no response. Marking as slave lost.) + driverEndpoint.askWithRetry[Boolean](RemoveExecutor(executorId, SlaveLost())) --- End diff -- Can you call `super.removeExecutor()` directly here, instead of doing the round-trip through the RPC layer? (Might need to check whether that method is thread-safe.) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10072][STREAMING] BlockGenerator can de...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8257#issuecomment-132355177 [Test build #1651 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1651/console) for PR 8257 at commit [`cb7bba2`](https://github.com/apache/spark/commit/cb7bba2f3ba1f3af87a55f2fc4f38da142099206). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10088] [sql] Add support for stored as...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8282#issuecomment-132355127 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10088] [sql] Add support for stored as...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8282#issuecomment-132355128 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41153/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9782] [YARN] Support YARN application t...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/8072 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10089] [sql] Add missing golden files.
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/8283 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10088] [sql] Add support for stored as...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/8282#issuecomment-132364808 Thanks! Merging to master and 1.5. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8167] Make tasks that fail from YARN pr...
Github user mccheah commented on a diff in the pull request: https://github.com/apache/spark/pull/8007#discussion_r37357691 --- Diff: core/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala --- @@ -91,6 +92,66 @@ private[spark] abstract class YarnSchedulerBackend( } /** + * Override the DriverEndpoint to add extra logic for the case when an executor is disconnected. + * We should check the cluster manager and find if the loss of the executor was caused by YARN + * force killing it due to preemption. + */ + private class YarnDriverEndpoint(rpcEnv: RpcEnv, sparkProperties: ArrayBuffer[(String, String)]) + extends DriverEndpoint(rpcEnv, sparkProperties) { + +private val pendingDisconnectedExecutors = new HashSet[String] +private val handleDisconnectedExecutorThreadPool = + ThreadUtils.newDaemonCachedThreadPool(yarn-driver-handle-lost-executor-thread-pool) + +/** + * When onDisconnected is received at the driver endpoint, the superclass DriverEndpoint + * handles it by assuming the Executor was lost for a bad reason and removes the executor + * immediately. + * + * In YARN's case however it is crucial to talk to the application master and ask why the + * executor had exited. In particular, the executor may have exited due to the executor + * having been preempted. If the executor exited normally according to the application + * master then we pass that information down to the TaskSetManager to inform the + * TaskSetManager that tasks on that lost executor should not count towards a job failure. + */ +override def onDisconnected(rpcAddress: RpcAddress): Unit = { --- End diff -- I guess there will be wasted work in the sense that the tasks will get allocated to the bad executor and then the executor will be removed and all of those tasks are relocated to the healthy ones. That's probably fine from a correctness standpoint but might create a bit of a performance latency... I'm open to the discussion of doing another architecture overhaul to get the soft-unregistration construct done. The other thing I'm wondering is if it's even worth offloading this communicate-with-AM logic to be asynchronous at all. How big of a performance penalty would it be to block the event loop with the request to the AM for the get-executor-loss-reason? I presumed that it was unacceptable to do that blocking request on the main event loop though. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10072][STREAMING] BlockGenerator can de...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8257#issuecomment-132368265 [Test build #1653 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1653/consoleFull) for PR 8257 at commit [`cb7bba2`](https://github.com/apache/spark/commit/cb7bba2f3ba1f3af87a55f2fc4f38da142099206). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5754] [yarn] Spark/Yarn/Windows driver/...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8053#issuecomment-132368081 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9893] user guide for VectorSlicer
Github user feynmanliang commented on the pull request: https://github.com/apache/spark/pull/8267#issuecomment-132333738 @yinxusen Sorry, I think there's some merge conflicts. Do you mind rebasing master? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9439] [yarn] External shuffle service r...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7943#issuecomment-132333839 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9893] user guide for VectorSlicer
Github user feynmanliang commented on a diff in the pull request: https://github.com/apache/spark/pull/8267#discussion_r37346725 --- Diff: docs/ml-features.md --- @@ -1389,3 +1389,145 @@ print(output.select(features, clicked).first()) # Feature Selectors +## VectorSlicer + +`VectorSlicer` is a transformer that takes a feature vector and outputs a new feature vector with a sub-array of the original features. It is useful for extracting features from a vector column. + +`VectorSlicer` accepts a vector column with a specified indices, then outputs a new vector column whose values are selected via those indices. There are two types of indices, + + 1. Integer indices that represents the real indices in the vector, `setIndices()`; --- End diff -- I would remove the word real (i.e. ...that represent the indices into the vector) since it could be confused for real numbers (i.e. real-valued indices, which don't really make sense) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10080][SQL] Fix binary incompatibility ...
Github user andrewor14 commented on the pull request: https://github.com/apache/spark/pull/8281#issuecomment-132336754 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10001] [CORE] Allow Ctrl-C in spark-she...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8216#issuecomment-132339598 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41148/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10001] [CORE] Allow Ctrl-C in spark-she...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8216#issuecomment-132339594 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9906] [ML] User guide for LogisticRegre...
Github user feynmanliang commented on a diff in the pull request: https://github.com/apache/spark/pull/8197#discussion_r37347701 --- Diff: docs/ml-linear-methods.md --- @@ -118,12 +133,114 @@ lrModel = lr.fit(training) print(Weights: + str(lrModel.weights)) print(Intercept: + str(lrModel.intercept)) {% endhighlight %} +/div /div +The `spark.ml` implementation of logistic regression also supports +extracting a summary of the model over the training set. Note that the +predictions and metrics which are stored as `Datafram`s in +`BinaryLogisticRegressionSummary` are annoted `@transient` and hence +only available on the driver. + +div class=codetabs + +div data-lang=scala markdown=1 + +[`LogisticRegressionTrainingSummary`](api/scala/index.html#org.apache.spark.ml.classification.LogisticRegressionTrainingSummary) +provides a summary for a +[`LogisticRegressionModel`](api/scala/index.html#org.apache.spark.ml.classification.LogisticRegressionModel). +Currently, only binary classification is supported and the +summary must be explicitly cast to +[`BinaryLogisticRegressionTrainingSummary`](api/scala/index.html#org.apache.spark.ml.classification.BinaryLogisticRegressionTrainingSummary). +This will likely change when multiclass classification is supported. + +Continuing the earlier example: + +{% highlight scala %} +// Extract the summary from the returned LogisticRegressionModel instance trained in the earlier example +val trainingSummary = lrModel.summary + +// Obtain the loss per iteration. +val objectiveHistory = trainingSummary.objectiveHistory +objectiveHistory.foreach(loss = println(loss)) + +// Obtain the metrics useful to judge performance on test data. +// We cast the summary to a BinaryLogisticRegressionSummary since the problem is a +// binary classification problem. +val binarySummary = trainingSummary.asInstanceOf[BinaryLogisticRegressionSummary] + +// Obtain the receiver-operating characteristic as a dataframe and areaUnderROC. +val roc = binarySummary.roc +roc.show() +roc.select(FPR).show() +println(binarySummary.areaUnderROC) + +// Get the threshold corresponding to the maximum F-Measure and rerun LogisticRegression with +// this selected threshold. +val fMeasure = binarySummary.fMeasureByThreshold +val maxFMeasure = fMeasure.select(max(F-Measure)).head().getDouble(0) +val bestThreshold = fMeasure.where($F-Measure === maxFMeasure). + select(threshold).head().getDouble(0) +logReg.setThreshold(bestThreshold) +logReg.fit(logRegDataFrame) +{% endhighlight %} /div -### Optimization +div data-lang=java markdown=1 +[`LogisticRegressionTrainingSummary`](api/java/org/apache/spark/ml/classification/LogisticRegressionTrainingSummary.html) +provides a summary for a +[`LogisticRegressionModel`](api/java/org/apache/spark/ml/classification/LogisticRegressionModel.html). +Currently, only binary classification is supported and the +summary must be explicitly cast to +[`BinaryLogisticRegressionTrainingSummary`](api/java/org/apache/spark/ml/classification/BinaryLogisticRegressionTrainingSummary.html). +This will likely change when multiclass classification is supported. + +Continuing the earlier example: + +{% highlight java %} +// Extract the summary from the returned LogisticRegressionModel instance trained in the earlier example +LogisticRegressionTrainingSummary trainingSummary = logRegModel.summary(); + +// Obtain the loss per iteration. +double[] objectiveHistory = trainingSummary.objectiveHistory(); +for (double lossPerIteration : objectiveHistory) { --- End diff -- Perhaps... the [Spark style guide](https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide) only covers Scala so I was just going by past experience on this change. We could try to get a Java style guide in if there's a community need for it --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10082] [MLlib] Validate i, j in apply D...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8271#issuecomment-132339896 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10093][SQL] Avoid transformation on exe...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8285#issuecomment-132340973 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10080][SQL] Fix binary incompatibility ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8281#issuecomment-132344114 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9906] [ML] User guide for LogisticRegre...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8197#issuecomment-132344168 [Test build #41162 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41162/consoleFull) for PR 8197 at commit [`7bf922c`](https://github.com/apache/spark/commit/7bf922c53b0e7f6e6d5304107f432b58ad7b93c7). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10080][SQL] Fix binary incompatibility ...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/8281#issuecomment-132347116 Merged to master and 1.5 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10080][SQL] Fix binary incompatibility ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/8281 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10095] [SQL] use public API of BigInteg...
GitHub user davies opened a pull request: https://github.com/apache/spark/pull/8286 [SPARK-10095] [SQL] use public API of BigInteger In UnsafeRow, we use the private field of BigInteger for better performance, but it actually didn't contribute much (3% in one benchmark) to end-to-end runtime, and make it not portable (may fail on other JVM implementations). So we should use the public API instead. cc @rxin You can merge this pull request into a Git repository by running: $ git pull https://github.com/davies/spark portable_decimal Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/8286.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #8286 commit e547fe80f59f83fe2b3934215975f9180c5da164 Author: Davies Liu dav...@databricks.com Date: 2015-08-18T20:59:58Z use public API of BigInteger --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8167] Make tasks that fail from YARN pr...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/8007#discussion_r37354476 --- Diff: core/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala --- @@ -91,6 +92,51 @@ private[spark] abstract class YarnSchedulerBackend( } /** + * Override the DriverEndpoint to add extra logic for the case when an executor is disconnected. + * We should check the cluster manager and find if the loss of the executor was caused by YARN + * force killing it due to preemption. + */ + private class YarnDriverEndpoint(rpcEnv: RpcEnv, sparkProperties: ArrayBuffer[(String, String)]) --- End diff -- Wait, I'm confused. `YarnSchedulerEndpoint` and `DriverEndpoint` are different things; `YarnSchedulerEndpoint` is a communication channel between the driver in YARN mode and the YARN AM, there are no executors involved. Why wouldn't that work here? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8167] Make tasks that fail from YARN pr...
Github user mccheah commented on a diff in the pull request: https://github.com/apache/spark/pull/8007#discussion_r37355322 --- Diff: core/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala --- @@ -91,6 +92,68 @@ private[spark] abstract class YarnSchedulerBackend( } /** + * Override the DriverEndpoint to add extra logic for the case when an executor is disconnected. + * We should check the cluster manager and find if the loss of the executor was caused by YARN + * force killing it due to preemption. + */ + private class YarnDriverEndpoint(rpcEnv: RpcEnv, sparkProperties: ArrayBuffer[(String, String)]) + extends DriverEndpoint(rpcEnv, sparkProperties) { + +private val pendingDisconnectedExecutors = new HashSet[String] +private val handleDisconnectedExecutorThreadPool = + ThreadUtils.newDaemonCachedThreadPool(yarn-driver-handle-lost-executor-thread-pool) + +/** + * When onDisconnected is received at the driver endpoint, the superclass DriverEndpoint + * handles it by assuming the Executor was lost for a bad reason and removes the executor + * immediately. + * + * In YARN's case however it is crucial to talk to the application master and ask why the + * executor had exited. In particular, the executor may have exited due to the executor + * having been preempted. If the executor exited normally according to the application + * master then we pass that information down to the TaskSetManager to inform the + * TaskSetManager that tasks on that lost executor should not count towards a job failure. + */ +override def onDisconnected(rpcAddress: RpcAddress): Unit = { + addressToExecutorId.get(rpcAddress).foreach({ executorId = +// onDisconnected could be fired multiple times from the same executor while we're +// asynchronously contacting the AM. So keep track of the executors we're trying to +// find loss reasons for and don't duplicate the work +if (!pendingDisconnectedExecutors.contains(executorId)) { + pendingDisconnectedExecutors.add(executorId) + handleDisconnectedExecutorThreadPool.submit(new Runnable() { +override def run(): Unit = { + val executorLossReason = + // Check for the loss reason and pass the loss reason to driverEndpoint + yarnSchedulerEndpoint.askWithRetry[Option[ExecutorLossReason]]( +GetExecutorLossReason(executorId)) + executorLossReason match { +case Some(reason) = + driverEndpoint.askWithRetry[Boolean](RemoveExecutor(executorId, reason)) +case None = + logWarning(sAttempted to get executor loss reason + +s for $rpcAddress but got no response. Marking as slave lost.) + driverEndpoint.askWithRetry[Boolean](RemoveExecutor(executorId, SlaveLost())) --- End diff -- Definitely don't think that's thread safe. It touches things like addressToExecutorId, which as we can see in the onDisconnected method itself is accessed in the event loop. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8167] Make tasks that fail from YARN pr...
Github user mccheah commented on a diff in the pull request: https://github.com/apache/spark/pull/8007#discussion_r37356390 --- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala --- @@ -207,6 +211,17 @@ private[yarn] class YarnAllocator( } /** + * Gets the executor loss reason for a disconnected executor. + * Note that this method is expected to be called exactly once per executor ID. + */ + def getExecutorLossReason(executorId: String): ExecutorLossReason = synchronized { +allocateResources() +// Expect to be asked for a loss reason once and exactly once. +assert(completedExecutorExitReasons.contains(executorId)) --- End diff -- That's up to the YARN daemons - I'm going off of the assumption that the AMRM client will always report the most up to date status about containers. If this isn't necessarily true then we should revisit this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8167] Make tasks that fail from YARN pr...
Github user markgrover commented on the pull request: https://github.com/apache/spark/pull/8007#issuecomment-132367263 Actually @markgrover can you describe in more detail how you were trying to use GetExecutorLossReason? So, I uploaded my [YARN](https://gist.github.com/markgrover/bb816fe8556e9498871a) and [driver](https://gist.github.com/markgrover/3c4ef4e3a7823864bbd8) logs. See [here](https://gist.github.com/markgrover/bb816fe8556e9498871a#file-yarn-log-L92) as to how the Executor Loss Reason is being asked for twice. I asserted that it was only requested once by looking at the driver log. You can do so too by searching for *Requesting loss reason for executorId: 2* in the [full driver log](https://gist.githubusercontent.com/markgrover/3c4ef4e3a7823864bbd8/raw/ec8793874b8ff0545fd61f6bdc6e7f0681f9de1c/driver.log). The relevant event receiving code snippet is [here](https://github.com/markgrover/spark/blob/5a90c9926a396cf6b0b68ee3fabbfc67ae07dcf7/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala#L213) and the relevant event sending code snippet is [here](https://github.com/markgrover/spark/blob/5a90c9926a396cf6b0b68ee3fabbfc67ae07dcf7/core/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala#L119). This is all before merging your latest change since I was trying this out last Friday. I wasn't able to figure out then why 2 events were being received. Anyways, I realize this is a lot of info and perhaps, hard to go through. I will poke at it more, perhaps, merge your latest commit and see if that helps. I am envisioning some pain given that your and my pull requests are sharing some code. If it continues for much longer, may be we should work off of the same branch. I personally am indifferent to whether I rebase on yours, or vice-versa. I always also hoping to find a spark IRC channel where we could collaborate real time but couldn't. I would definitely be open to hacking on this in a more collaborative way if you think it'd help (I think it would). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10072][STREAMING] BlockGenerator can de...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8257#issuecomment-132369841 [Test build #1652 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1652/console) for PR 8257 at commit [`cb7bba2`](https://github.com/apache/spark/commit/cb7bba2f3ba1f3af87a55f2fc4f38da142099206). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10004] [shuffle] Perform auth checks wh...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8218#issuecomment-132369820 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41156/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8918] [MLLIB] [DOC] Add @since tags to ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8288#issuecomment-132372147 [Test build #41166 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41166/consoleFull) for PR 8288 at commit [`72fdeb6`](https://github.com/apache/spark/commit/72fdeb64630470f6f46cf3eed8ffbfe83a7c4659). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10017] [MLlib]: ML model broadcasts sho...
Github user sabhyankar commented on the pull request: https://github.com/apache/spark/pull/8241#issuecomment-132374365 @holdenk Not sure if you are reviewing the other PRs, but the fix should now be in all of them. Thx! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10098][STREAMING][TEST] Cleanup active ...
GitHub user tdas opened a pull request: https://github.com/apache/spark/pull/8289 [SPARK-10098][STREAMING][TEST] Cleanup active context after test in FailureSuite Failures in streaming.FailureSuite can leak StreamingContext and SparkContext which fails all subsequent tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/tdas/spark SPARK-10098 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/8289.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #8289 commit 3545d1efb7c46a23dbb13dae0d65a5c8d3e9aca6 Author: Tathagata Das tathagata.das1...@gmail.com Date: 2015-08-18T22:22:07Z Cleanup active contexts after test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8473][SPARK-9889][ML] User guide and ex...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/8184#discussion_r37361712 --- Diff: docs/ml-features.md --- @@ -649,6 +649,77 @@ for expanded in polyDF.select(polyFeatures).take(3): /div /div +## Discrete Cosine Transform (DCT) + +The [Discrete Cosine +Transform](https://en.wikipedia.org/wiki/Discrete_cosine_transform) +transforms a length $N$ real-valued sequence in the time domain into +another length $N$ real-valued sequence in the frequency domain. A +[DCT](api/scala/index.html#org.apache.spark.ml.feature.DCT) class +provides this functionality, implementing the +[DCT-II](https://en.wikipedia.org/wiki/Discrete_cosine_transform#DCT-II) +and scaling the result by $1/\sqrt{2}$ such that the representing matrix +for the transform is unitary. No shift is applied to the transformed +sequence (e.g. the $0$th element of the transformed sequence is the +$0$th DCT coefficient and _not_ the $N/2$th). + +div class=codetabs +div data-lang=scala markdown=1 +{% highlight scala %} +import org.apache.spark.ml.feature.DCT +import org.apache.spark.mllib.linalg.Vectors + +val data = Seq( + Vectors.dense(0.0, 1.0, -2.0, 3.0), + Vectors.dense(-1.0, 2.0, 4.0, -7.0), + Vectors.dense(14.0, -2.0, -5.0, 1.0)) +val df = sqlContext.createDataFrame(data.map(Tuple1.apply)).toDF(features) +val dct = new DCT() + .setInputCol(features) + .setOutputCol(featuresDCT) + .setInverse(false) +val dctDf = dct.transform(df) +dctDf.select(featuresDCT).show(3) +{% endhighlight %} +/div + +div data-lang=java markdown=1 +{% highlight java %} +import java.util.Arrays; + +import org.apache.spark.api.java.JavaRDD; +import org.apache.spark.api.java.JavaSparkContext; +import org.apache.spark.ml.feature.DCT; +import org.apache.spark.mllib.linalg.Vector; +import org.apache.spark.mllib.linalg.VectorUDT; +import org.apache.spark.mllib.linalg.Vectors; +import org.apache.spark.sql.DataFrame; +import org.apache.spark.sql.Row; +import org.apache.spark.sql.RowFactory; +import org.apache.spark.sql.SQLContext; +import org.apache.spark.sql.types.Metadata; +import org.apache.spark.sql.types.StructField; +import org.apache.spark.sql.types.StructType; + +JavaRDDRow data = jsc.parallelize(Arrays.asList( + RowFactory.create(Vectors.dense(0.0, 1.0, -2.0, 3.0)), + RowFactory.create(Vectors.dense(-1.0, 2.0, 4.0, -7.0)), + RowFactory.create(Vectors.dense(14.0, -2.0, -5.0, 1.0)) +)); +StructType schema = new StructType(new StructField[] { + new StructField(features, new VectorUDT(), false, Metadata.empty()), +}); +DataFrame df = jsql.createDataFrame(data, schema); +DCT dct = new DCT() + .setInputCol(features) + .setOutputCol(featuresDCT) + .setInverse(false); +DataFrame dctDf = dct.transform(df); +dctDf.select(featuresDCT).take(3).show(3); --- End diff -- Remove take(3) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10012][ML] Missing test case for Params...
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/8223#issuecomment-132375517 No, just trying not merge stuff which isn't critical, but sure, I'll merge it with master and branch-1.5 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org