date:20150818

Github user mccheah commented on a diff in the pull request:

https://github.com/apache/spark/pull/8007#discussion_r37359578
  
--- Diff: 
yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala ---
@@ -207,6 +211,17 @@ private[yarn] class YarnAllocator(
   }
 
   /**
+   * Gets the executor loss reason for a disconnected executor.
+   * Note that this method is expected to be called exactly once per 
executor ID.
+   */
+  def getExecutorLossReason(executorId: String): ExecutorLossReason = 
synchronized {
+allocateResources()
+// Expect to be asked for a loss reason once and exactly once.
+assert(completedExecutorExitReasons.contains(executorId))
--- End diff --

In such a case that AMRM client can't be guaranteed to find the latest 
container exit statuses, then we would need to do polling and perhaps "give up" 
after a few tries? I don't see how we could ever guarantee that 
getExecutorLossReason could ever be correct if we can't ever be guaranteed to 
find the correct state from the NM.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8924] [MLLIB, DOCUMENTATION] Added @sin...

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/7380#issuecomment-132368741
  
Okay, I merged this into master and branch-1.5 so we can unblock 
SPARK-9864. @MechCoder Please fix the issue you found in your PR. Thanks 
@BryanCutler for adding versions and @MechCoder for review!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8167] Make tasks that fail from YARN pr...

Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/8007#discussion_r37359214
  
--- Diff: 
yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala ---
@@ -207,6 +211,17 @@ private[yarn] class YarnAllocator(
   }
 
   /**
+   * Gets the executor loss reason for a disconnected executor.
+   * Note that this method is expected to be called exactly once per 
executor ID.
+   */
+  def getExecutorLossReason(executorId: String): ExecutorLossReason = 
synchronized {
+allocateResources()
+// Expect to be asked for a loss reason once and exactly once.
+assert(completedExecutorExitReasons.contains(executorId))
--- End diff --

Your assumption probably holds for the preemption case, since it's YARN 
killing the container. But I can imagine that if the container exits by itself, 
it might be possible for the disconnect to reach the driver endpoint and the 
`GetExecutorLossReason` message to reach the AM before the NM has had a chance 
to process the container exit and communicate that to the RM.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8924] [MLLIB, DOCUMENTATION] Added @sin...

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/7380


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10072][STREAMING] BlockGenerator can de...

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8257#issuecomment-132368265
  
  [Test build #1653 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1653/consoleFull)
 for   PR 8257 at commit 
[`cb7bba2`](https://github.com/apache/spark/commit/cb7bba2f3ba1f3af87a55f2fc4f38da142099206).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5754] [yarn] Spark/Yarn/Windows driver/...

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8053#issuecomment-132368081
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5754] [yarn] Spark/Yarn/Windows driver/...

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8053#issuecomment-132368084
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41157/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5754] [yarn] Spark/Yarn/Windows driver/...

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8053#issuecomment-132367967
  
  [Test build #41157 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41157/console)
 for   PR 8053 at commit 
[`5be3e44`](https://github.com/apache/spark/commit/5be3e449aa0306c41398408157a7db1cd94f1aa8).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8167] Make tasks that fail from YARN pr...

2015-08-18 Thread markgrover

Github user markgrover commented on the pull request:

https://github.com/apache/spark/pull/8007#issuecomment-132367263

>Actually @markgrover can you describe in more detail how you were trying
to use GetExecutorLossReason?

So, I uploaded my
[YARN](https://gist.github.com/markgrover/bb816fe8556e9498871a) and
[driver](https://gist.github.com/markgrover/3c4ef4e3a7823864bbd8) logs.

See
[here](https://gist.github.com/markgrover/bb816fe8556e9498871a#file-yarn-log-L92)
as to how the Executor Loss Reason is being asked for twice. I asserted that
it was only requested once by looking at the driver log. You can do so too by
searching for *Requesting loss reason for executorId: 2* in the [full driver
log](https://gist.githubusercontent.com/markgrover/3c4ef4e3a7823864bbd8/raw/ec8793874b8ff0545fd61f6bdc6e7f0681f9de1c/driver.log).

The relevant event receiving code snippet is
[here](https://github.com/markgrover/spark/blob/5a90c9926a396cf6b0b68ee3fabbfc67ae07dcf7/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala#L213)
and the relevant event sending code snippet is
[here](https://github.com/markgrover/spark/blob/5a90c9926a396cf6b0b68ee3fabbfc67ae07dcf7/core/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala#L119).

This is all before merging your latest change since I was trying this out
last Friday. I wasn't able to figure out then why 2 events were being received.

Anyways, I realize this is a lot of info and perhaps, hard to go through. I
will poke at it more, perhaps, merge your latest commit and see if that helps.
I am envisioning some pain given that your and my pull requests are sharing
some code. If it continues for much longer, may be we should work off of the
same branch. I personally am indifferent to whether I rebase on yours, or
vice-versa.

I always also hoping to find a spark IRC channel where we could collaborate
real time but couldn't. I would definitely be open to hacking on this in a more
collaborative way if you think it'd help (I think it would).

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8167] Make tasks that fail from YARN pr...

Github user mccheah commented on a diff in the pull request:

https://github.com/apache/spark/pull/8007#discussion_r37358327
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala
 ---
@@ -91,6 +92,66 @@ private[spark] abstract class YarnSchedulerBackend(
   }
 
   /**
+   * Override the DriverEndpoint to add extra logic for the case when an 
executor is disconnected.
+   * We should check the cluster manager and find if the loss of the 
executor was caused by YARN
+   * force killing it due to preemption.
+   */
+  private class YarnDriverEndpoint(rpcEnv: RpcEnv, sparkProperties: 
ArrayBuffer[(String, String)])
+  extends DriverEndpoint(rpcEnv, sparkProperties) {
+
+private val pendingDisconnectedExecutors = new HashSet[String]
+private val handleDisconnectedExecutorThreadPool =
+  
ThreadUtils.newDaemonCachedThreadPool("yarn-driver-handle-lost-executor-thread-pool")
+
+/**
+ * When onDisconnected is received at the driver endpoint, the 
superclass DriverEndpoint
+ * handles it by assuming the Executor was lost for a bad reason and 
removes the executor
+ * immediately.
+ *
+ * In YARN's case however it is crucial to talk to the application 
master and ask why the
+ * executor had exited. In particular, the executor may have exited 
due to the executor
+ * having been preempted. If the executor "exited normally" according 
to the application
+ * master then we pass that information down to the TaskSetManager to 
inform the
+ * TaskSetManager that tasks on that lost executor should not count 
towards a job failure.
+ */
+override def onDisconnected(rpcAddress: RpcAddress): Unit = {
--- End diff --

To @markgrover's question, yes, by overriding the method then only this 
implementation will be invoked.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8167] Make tasks that fail from YARN pr...

Github user mccheah commented on a diff in the pull request:

https://github.com/apache/spark/pull/8007#discussion_r37358117
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala
 ---
@@ -91,6 +92,66 @@ private[spark] abstract class YarnSchedulerBackend(
   }
 
   /**
+   * Override the DriverEndpoint to add extra logic for the case when an 
executor is disconnected.
+   * We should check the cluster manager and find if the loss of the 
executor was caused by YARN
+   * force killing it due to preemption.
+   */
+  private class YarnDriverEndpoint(rpcEnv: RpcEnv, sparkProperties: 
ArrayBuffer[(String, String)])
+  extends DriverEndpoint(rpcEnv, sparkProperties) {
+
+private val pendingDisconnectedExecutors = new HashSet[String]
+private val handleDisconnectedExecutorThreadPool =
+  
ThreadUtils.newDaemonCachedThreadPool("yarn-driver-handle-lost-executor-thread-pool")
+
+/**
+ * When onDisconnected is received at the driver endpoint, the 
superclass DriverEndpoint
+ * handles it by assuming the Executor was lost for a bad reason and 
removes the executor
+ * immediately.
+ *
+ * In YARN's case however it is crucial to talk to the application 
master and ask why the
+ * executor had exited. In particular, the executor may have exited 
due to the executor
+ * having been preempted. If the executor "exited normally" according 
to the application
+ * master then we pass that information down to the TaskSetManager to 
inform the
+ * TaskSetManager that tasks on that lost executor should not count 
towards a job failure.
+ */
+override def onDisconnected(rpcAddress: RpcAddress): Unit = {
--- End diff --

I guess the scary part is that if the executor died from an actual failure 
and tasks run on that bad executor, then we get more tasks that are marked as 
failed than we would have otherwise.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10060] [ML] [DOC] spark.ml DecisionTree...

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/8244#discussion_r37358090
  
--- Diff: docs/ml-decision-tree.md ---
@@ -0,0 +1,506 @@
+---
+layout: global
+title: Decision Trees - SparkML
+displayTitle: ML - Decision Trees
+---
+
+**Table of Contents**
+
+* This will become a table of contents (this text will be scraped).
+{:toc}
+
+
+# Overview
+
+[Decision trees](http://en.wikipedia.org/wiki/Decision_tree_learning)
+and their ensembles are popular methods for the machine learning tasks of
+classification and regression. Decision trees are widely used since they 
are easy to interpret,
+handle categorical features, extend to the multiclass classification 
setting, do not require
+feature scaling, and are able to capture non-linearities and feature 
interactions. Tree ensemble
+algorithms such as random forests and boosting are among the top 
performers for classification and
+regression tasks.
+
+MLlib supports decision trees for binary and multiclass classification and 
for regression,
+using both continuous and categorical features. The implementation 
partitions data by rows,
+allowing distributed training with millions or even billions of instances.
+
+Users can find more information about the decision tree algorithm in the 
[MLlib Decision Tree guide](mllib-decision-tree.html).  In this section, we 
demonstrate the Pipelines API for Decision Trees.
+
+The Pipelines API for Decision Trees offers a bit more functionality than 
the original API.  In particular, for classification, users can get the 
predicted probability of each class (a.k.a. class conditional probabilities).
+
+Ensembles of trees (Random Forests and Gradient-Boosted Trees) are 
described in the [Ensembles guide](ml-ensembles.html).
+
+# Inputs and Outputs (Predictions)
+
+We list the input and output (prediction) column types here.
+All output columns are optional; to exclude an output column, set its 
corresponding Param to an empty string.
+
+## Input Columns
+
+
+  
+
+  Param name
+  Type(s)
+  Default
+  Description
+
+  
+  
+
+  labelCol
+  Double
+  "label"
+  Label to predict
+
+
+  featuresCol
+  Vector
+  "features"
+  Feature vector
+
+  
+
+
+## Output Columns
+
+
+  
+
+  Param name
+  Type(s)
+  Default
+  Description
+  Notes
+
+  
+  
+
+  predictionCol
+  Double
+  "prediction"
+  Predicted label
+  
+
+
+  rawPredictionCol
+  Vector
+  "rawPrediction"
+  Vector of length # classes, with the counts of training instance 
labels at the tree node which makes the prediction
+  Classification only
+
+
+  probabilityCol
+  Vector
+  "probability"
+  Vector of length # classes equal to rawPrediction normalized to 
a multinomial distribution
+  Classification only
+
+  
+
+
+# Examples
+
+The below examples demonstrate the Pipelines API for Decision Trees. The 
main differences between this API and the [original MLlib Decision Tree 
API](mllib-decision-tree.html) are:
+
+* support for ML Pipelines
+* separation of Decision Trees for classification vs. regression
+* use of DataFrame metadata to distinguish continuous and categorical 
features
+
+
+## Classification
+
+The following examples load a dataset in LibSVM format, split it into 
training and test sets, train on the first dataset, and then evaluate on the 
held-out test set.
+We use two feature transformers to prepare the data; these help index 
categories for the label and categorical features, adding metadata to the 
`DataFrame` which the Decision Tree algorithm can recognize.
+
+
+
+
+More details on parameters can be found in the [Scala API 
documentation](api/scala/index.html#org.apache.spark.ml.classification.DecisionTreeClassifier).
+
+{% highlight scala %}
+import org.apache.spark.ml.Pipeline
+import org.apache.spark.ml.classification.DecisionTreeClassifier
+import org.apache.spark.ml.classification.DecisionTreeClassificationModel
+import org.apache.spark.ml.feature.{StringIndexer, IndexToString, 
VectorIndexer}
+import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
+import org.apache.spark.mllib.util.MLUtils
+
+// Load and parse the data file, converting it to a DataFrame.
+val data = MLUtils.loadLibSVMFile(sc, 
"data/mllib/sample_libsvm_data.txt").toDF()
+
+// Index labels, adding m

[GitHub] spark pull request: [SPARK-10060] [ML] [DOC] spark.ml DecisionTree...

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/8244#discussion_r37357835
  
--- Diff: docs/ml-decision-tree.md ---
@@ -0,0 +1,506 @@
+---
+layout: global
+title: Decision Trees - SparkML
+displayTitle: ML - Decision Trees
+---
+
+**Table of Contents**
+
+* This will become a table of contents (this text will be scraped).
+{:toc}
+
+
+# Overview
+
+[Decision trees](http://en.wikipedia.org/wiki/Decision_tree_learning)
+and their ensembles are popular methods for the machine learning tasks of
+classification and regression. Decision trees are widely used since they 
are easy to interpret,
+handle categorical features, extend to the multiclass classification 
setting, do not require
+feature scaling, and are able to capture non-linearities and feature 
interactions. Tree ensemble
+algorithms such as random forests and boosting are among the top 
performers for classification and
+regression tasks.
+
+MLlib supports decision trees for binary and multiclass classification and 
for regression,
+using both continuous and categorical features. The implementation 
partitions data by rows,
+allowing distributed training with millions or even billions of instances.
+
+Users can find more information about the decision tree algorithm in the 
[MLlib Decision Tree guide](mllib-decision-tree.html).  In this section, we 
demonstrate the Pipelines API for Decision Trees.
+
+The Pipelines API for Decision Trees offers a bit more functionality than 
the original API.  In particular, for classification, users can get the 
predicted probability of each class (a.k.a. class conditional probabilities).
+
+Ensembles of trees (Random Forests and Gradient-Boosted Trees) are 
described in the [Ensembles guide](ml-ensembles.html).
+
+# Inputs and Outputs (Predictions)
+
+We list the input and output (prediction) column types here.
+All output columns are optional; to exclude an output column, set its 
corresponding Param to an empty string.
+
+## Input Columns
+
+
+  
+
+  Param name
+  Type(s)
+  Default
+  Description
+
+  
+  
+
+  labelCol
+  Double
+  "label"
+  Label to predict
+
+
+  featuresCol
+  Vector
+  "features"
+  Feature vector
+
+  
+
+
+## Output Columns
+
+
+  
+
+  Param name
+  Type(s)
+  Default
+  Description
+  Notes
+
+  
+  
+
+  predictionCol
+  Double
+  "prediction"
+  Predicted label
+  
+
+
+  rawPredictionCol
+  Vector
+  "rawPrediction"
+  Vector of length # classes, with the counts of training instance 
labels at the tree node which makes the prediction
+  Classification only
+
+
+  probabilityCol
+  Vector
+  "probability"
+  Vector of length # classes equal to rawPrediction normalized to 
a multinomial distribution
+  Classification only
+
+  
+
+
+# Examples
+
+The below examples demonstrate the Pipelines API for Decision Trees. The 
main differences between this API and the [original MLlib Decision Tree 
API](mllib-decision-tree.html) are:
+
+* support for ML Pipelines
+* separation of Decision Trees for classification vs. regression
+* use of DataFrame metadata to distinguish continuous and categorical 
features
+
+
+## Classification
+
+The following examples load a dataset in LibSVM format, split it into 
training and test sets, train on the first dataset, and then evaluate on the 
held-out test set.
+We use two feature transformers to prepare the data; these help index 
categories for the label and categorical features, adding metadata to the 
`DataFrame` which the Decision Tree algorithm can recognize.
+
+
+
+
+More details on parameters can be found in the [Scala API 
documentation](api/scala/index.html#org.apache.spark.ml.classification.DecisionTreeClassifier).
+
+{% highlight scala %}
+import org.apache.spark.ml.Pipeline
+import org.apache.spark.ml.classification.DecisionTreeClassifier
+import org.apache.spark.ml.classification.DecisionTreeClassificationModel
+import org.apache.spark.ml.feature.{StringIndexer, IndexToString, 
VectorIndexer}
+import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
+import org.apache.spark.mllib.util.MLUtils
+
+// Load and parse the data file, converting it to a DataFrame.
+val data = MLUtils.loadLibSVMFile(sc, 
"data/mllib/sample_libsvm_data.txt").toDF()
+
+// Index labels, adding m

[GitHub] spark pull request: [SPARK-10060] [ML] [DOC] spark.ml DecisionTree...

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/8244#discussion_r37357841
  
--- Diff: docs/ml-decision-tree.md ---
@@ -0,0 +1,506 @@
+---
+layout: global
+title: Decision Trees - SparkML
+displayTitle: ML - Decision Trees
+---
+
+**Table of Contents**
+
+* This will become a table of contents (this text will be scraped).
+{:toc}
+
+
+# Overview
+
+[Decision trees](http://en.wikipedia.org/wiki/Decision_tree_learning)
+and their ensembles are popular methods for the machine learning tasks of
+classification and regression. Decision trees are widely used since they 
are easy to interpret,
+handle categorical features, extend to the multiclass classification 
setting, do not require
+feature scaling, and are able to capture non-linearities and feature 
interactions. Tree ensemble
+algorithms such as random forests and boosting are among the top 
performers for classification and
+regression tasks.
+
+MLlib supports decision trees for binary and multiclass classification and 
for regression,
+using both continuous and categorical features. The implementation 
partitions data by rows,
+allowing distributed training with millions or even billions of instances.
+
+Users can find more information about the decision tree algorithm in the 
[MLlib Decision Tree guide](mllib-decision-tree.html).  In this section, we 
demonstrate the Pipelines API for Decision Trees.
+
+The Pipelines API for Decision Trees offers a bit more functionality than 
the original API.  In particular, for classification, users can get the 
predicted probability of each class (a.k.a. class conditional probabilities).
+
+Ensembles of trees (Random Forests and Gradient-Boosted Trees) are 
described in the [Ensembles guide](ml-ensembles.html).
+
+# Inputs and Outputs (Predictions)
+
+We list the input and output (prediction) column types here.
+All output columns are optional; to exclude an output column, set its 
corresponding Param to an empty string.
+
+## Input Columns
+
+
+  
+
+  Param name
+  Type(s)
+  Default
+  Description
+
+  
+  
+
+  labelCol
+  Double
+  "label"
+  Label to predict
+
+
+  featuresCol
+  Vector
+  "features"
+  Feature vector
+
+  
+
+
+## Output Columns
+
+
+  
+
+  Param name
+  Type(s)
+  Default
+  Description
+  Notes
+
+  
+  
+
+  predictionCol
+  Double
+  "prediction"
+  Predicted label
+  
+
+
+  rawPredictionCol
+  Vector
+  "rawPrediction"
+  Vector of length # classes, with the counts of training instance 
labels at the tree node which makes the prediction
+  Classification only
+
+
+  probabilityCol
+  Vector
+  "probability"
+  Vector of length # classes equal to rawPrediction normalized to 
a multinomial distribution
+  Classification only
+
+  
+
+
+# Examples
+
+The below examples demonstrate the Pipelines API for Decision Trees. The 
main differences between this API and the [original MLlib Decision Tree 
API](mllib-decision-tree.html) are:
+
+* support for ML Pipelines
+* separation of Decision Trees for classification vs. regression
+* use of DataFrame metadata to distinguish continuous and categorical 
features
+
+
+## Classification
+
+The following examples load a dataset in LibSVM format, split it into 
training and test sets, train on the first dataset, and then evaluate on the 
held-out test set.
+We use two feature transformers to prepare the data; these help index 
categories for the label and categorical features, adding metadata to the 
`DataFrame` which the Decision Tree algorithm can recognize.
+
+
+
+
+More details on parameters can be found in the [Scala API 
documentation](api/scala/index.html#org.apache.spark.ml.classification.DecisionTreeClassifier).
+
+{% highlight scala %}
+import org.apache.spark.ml.Pipeline
+import org.apache.spark.ml.classification.DecisionTreeClassifier
+import org.apache.spark.ml.classification.DecisionTreeClassificationModel
+import org.apache.spark.ml.feature.{StringIndexer, IndexToString, 
VectorIndexer}
+import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
+import org.apache.spark.mllib.util.MLUtils
+
+// Load and parse the data file, converting it to a DataFrame.
+val data = MLUtils.loadLibSVMFile(sc, 
"data/mllib/sample_libsvm_data.txt").toDF()
+
+// Index labels, adding m

[GitHub] spark pull request: [SPARK-10088] [sql] Add support for "stored as...

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/8282


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10054][Streaming]Add a timeout for laun...

2015-08-18 Thread tdas

Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/8242#discussion_r37357797
  
--- Diff: 
streaming/src/main/scala/org/apache/spark/streaming/scheduler/ReceiverTracker.scala
 ---
@@ -439,6 +465,30 @@ class ReceiverTracker(ssc: StreamingContext, 
skipReceiverLaunch: Boolean = false
 for (info <- receiverTrackingInfos.get(streamUID); eP <- 
info.endpoint) {
   eP.send(UpdateRateLimit(newRate))
 }
+  case ReceiverLaunchingTimeout(receiverId) => {
+val timeoutFuture = receiverTimeoutFutures.remove(receiverId)
+timeoutFuture.foreach { f =>
+  // If the receiver has not yet registered, start a new thread to 
stop StreamingContext
+  // gracefully.
+  new Thread("stopping-StreamingContext") {
+setDaemon(true)
+
+override def run(): Unit = {
+  if (isTrackerStarted) {
+val stopSparkContext =
--- End diff --

Actually, you have to stop the context with an exception because something 
went wrong and the receiver could not be scheduled. Use reportError()


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10012][ML] Missing test case for Params...

2015-08-18 Thread srowen

Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/8223#issuecomment-132365457
  
@jkbradley do we need to wait for something here? things can go into 
`master` safely; this looks safe for `branch-1.5` too regardless of whether it 
makes this RC or another release. Did I miss some email about an RC and not 
merging?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10017] [MLlib]: ML model broadcasts sho...

2015-08-18 Thread holdenk

Github user holdenk commented on the pull request:

https://github.com/apache/spark/pull/8241#issuecomment-132365264
  
Great :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8167] Make tasks that fail from YARN pr...

Github user mccheah commented on a diff in the pull request:

https://github.com/apache/spark/pull/8007#discussion_r37357691
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala
 ---
@@ -91,6 +92,66 @@ private[spark] abstract class YarnSchedulerBackend(
   }
 
   /**
+   * Override the DriverEndpoint to add extra logic for the case when an 
executor is disconnected.
+   * We should check the cluster manager and find if the loss of the 
executor was caused by YARN
+   * force killing it due to preemption.
+   */
+  private class YarnDriverEndpoint(rpcEnv: RpcEnv, sparkProperties: 
ArrayBuffer[(String, String)])
+  extends DriverEndpoint(rpcEnv, sparkProperties) {
+
+private val pendingDisconnectedExecutors = new HashSet[String]
+private val handleDisconnectedExecutorThreadPool =
+  
ThreadUtils.newDaemonCachedThreadPool("yarn-driver-handle-lost-executor-thread-pool")
+
+/**
+ * When onDisconnected is received at the driver endpoint, the 
superclass DriverEndpoint
+ * handles it by assuming the Executor was lost for a bad reason and 
removes the executor
+ * immediately.
+ *
+ * In YARN's case however it is crucial to talk to the application 
master and ask why the
+ * executor had exited. In particular, the executor may have exited 
due to the executor
+ * having been preempted. If the executor "exited normally" according 
to the application
+ * master then we pass that information down to the TaskSetManager to 
inform the
+ * TaskSetManager that tasks on that lost executor should not count 
towards a job failure.
+ */
+override def onDisconnected(rpcAddress: RpcAddress): Unit = {
--- End diff --

I guess there will be wasted work in the sense that the tasks will get 
allocated to the bad executor and then the executor will be removed and all of 
those tasks are relocated to the healthy ones. That's probably fine from a 
correctness standpoint but might create a bit of a performance latency... I'm 
open to the discussion of doing another architecture overhaul to get the 
soft-unregistration construct done.

The other thing I'm wondering is if it's even worth offloading this 
communicate-with-AM logic to be asynchronous at all. How big of a performance 
penalty would it be to block the event loop with the request to the AM for the 
get-executor-loss-reason? I presumed that it was unacceptable to do that 
blocking request on the main event loop though.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10088] [sql] Add support for "stored as...

2015-08-18 Thread marmbrus

Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/8282#issuecomment-132364808
  
Thanks!  Merging to master and 1.5.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10089] [sql] Add missing golden files.

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/8283


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10017] [MLlib]: ML model broadcasts sho...

2015-08-18 Thread sabhyankar

Github user sabhyankar commented on the pull request:

https://github.com/apache/spark/pull/8241#issuecomment-132364417
  
@holdenk yep :) I am updating those as well!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10017] [MLlib]: ML model broadcasts sho...

2015-08-18 Thread holdenk

Github user holdenk commented on the pull request:

https://github.com/apache/spark/pull/8241#issuecomment-132363721
  
Awesome, it also seemed to be in many of your related PRs, you might want 
to update those as well.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10089] [sql] Add missing golden files.

2015-08-18 Thread marmbrus

Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/8283#issuecomment-132363515
  
Merging to master and 1.5!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10017] [MLlib]: ML model broadcasts sho...

2015-08-18 Thread sabhyankar

Github user sabhyankar commented on the pull request:

https://github.com/apache/spark/pull/8241#issuecomment-132362277
  
Thanks for pointing that out @holdenk ! I have pushed a change to the PR!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9782] [YARN] Support YARN application t...

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/8072


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8167] Make tasks that fail from YARN pr...

Github user mccheah commented on a diff in the pull request:

https://github.com/apache/spark/pull/8007#discussion_r37356728
  
--- Diff: 
yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala ---
@@ -207,6 +211,17 @@ private[yarn] class YarnAllocator(
   }
 
   /**
+   * Gets the executor loss reason for a disconnected executor.
+   * Note that this method is expected to be called exactly once per 
executor ID.
+   */
+  def getExecutorLossReason(executorId: String): ExecutorLossReason = 
synchronized {
+allocateResources()
+// Expect to be asked for a loss reason once and exactly once.
+assert(completedExecutorExitReasons.contains(executorId))
--- End diff --

By the way I'm open to removing this assertion and building up the map 
without removing items from it, but I'm not sure of what the memory 
implications are.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8167] Make tasks that fail from YARN pr...

Github user mccheah commented on a diff in the pull request:

https://github.com/apache/spark/pull/8007#discussion_r37356390
  
--- Diff: 
yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala ---
@@ -207,6 +211,17 @@ private[yarn] class YarnAllocator(
   }
 
   /**
+   * Gets the executor loss reason for a disconnected executor.
+   * Note that this method is expected to be called exactly once per 
executor ID.
+   */
+  def getExecutorLossReason(executorId: String): ExecutorLossReason = 
synchronized {
+allocateResources()
+// Expect to be asked for a loss reason once and exactly once.
+assert(completedExecutorExitReasons.contains(executorId))
--- End diff --

That's up to the YARN daemons - I'm going off of the assumption that the 
AMRM client will always report the most up to date status about containers. If 
this isn't necessarily true then we should revisit this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8167] Make tasks that fail from YARN pr...

Github user vanzin commented on the pull request:

https://github.com/apache/spark/pull/8007#issuecomment-132358323
  
This is definitely going in the right direction; there's a couple of things 
that I think need some further investigation, though.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8167] Make tasks that fail from YARN pr...

Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/8007#discussion_r37356067
  
--- Diff: 
yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala ---
@@ -207,6 +211,17 @@ private[yarn] class YarnAllocator(
   }
 
   /**
+   * Gets the executor loss reason for a disconnected executor.
+   * Note that this method is expected to be called exactly once per 
executor ID.
+   */
+  def getExecutorLossReason(executorId: String): ExecutorLossReason = 
synchronized {
+allocateResources()
+// Expect to be asked for a loss reason once and exactly once.
+assert(completedExecutorExitReasons.contains(executorId))
--- End diff --

Isn't this racy? Is there a possibility that even after calling 
`allocateResources` we'll still not know about why a particular executor has 
exited?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8167] Make tasks that fail from YARN pr...

Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/8007#discussion_r37355811
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala
 ---
@@ -144,6 +207,21 @@ private[spark] abstract class YarnSchedulerBackend(
 context.reply(false)
 }
 
+  case c: GetExecutorLossReason =>
+amEndpoint match {
+  case Some(am) =>
+Future {
+  context.reply(am.askWithRetry[Option[ExecutorLossReason]](c))
+} onFailure {
+  case NonFatal(e) =>
+logError(s"Finding the executor loss reason was 
unsuccessful", e)
+context.sendFailure(e)
+}
+  case None =>
--- End diff --

This is not your fault, but this is starting to get pretty noisy. We need a 
better way to do this check.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10072][STREAMING] BlockGenerator can de...

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8257#issuecomment-132356303
  
  [Test build #1652 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1652/consoleFull)
 for   PR 8257 at commit 
[`cb7bba2`](https://github.com/apache/spark/commit/cb7bba2f3ba1f3af87a55f2fc4f38da142099206).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8167] Make tasks that fail from YARN pr...

Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/8007#discussion_r37355580
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala
 ---
@@ -91,6 +92,66 @@ private[spark] abstract class YarnSchedulerBackend(
   }
 
   /**
+   * Override the DriverEndpoint to add extra logic for the case when an 
executor is disconnected.
+   * We should check the cluster manager and find if the loss of the 
executor was caused by YARN
+   * force killing it due to preemption.
+   */
+  private class YarnDriverEndpoint(rpcEnv: RpcEnv, sparkProperties: 
ArrayBuffer[(String, String)])
+  extends DriverEndpoint(rpcEnv, sparkProperties) {
+
+private val pendingDisconnectedExecutors = new HashSet[String]
+private val handleDisconnectedExecutorThreadPool =
+  
ThreadUtils.newDaemonCachedThreadPool("yarn-driver-handle-lost-executor-thread-pool")
+
+/**
+ * When onDisconnected is received at the driver endpoint, the 
superclass DriverEndpoint
+ * handles it by assuming the Executor was lost for a bad reason and 
removes the executor
+ * immediately.
+ *
+ * In YARN's case however it is crucial to talk to the application 
master and ask why the
+ * executor had exited. In particular, the executor may have exited 
due to the executor
+ * having been preempted. If the executor "exited normally" according 
to the application
+ * master then we pass that information down to the TaskSetManager to 
inform the
+ * TaskSetManager that tasks on that lost executor should not count 
towards a job failure.
+ */
+override def onDisconnected(rpcAddress: RpcAddress): Unit = {
--- End diff --

One side-effect of this is that there will be a delay between the executor 
disconnecting from the driver and it being marked as unavailable for running 
tasks. So isn't it possible that while we're waiting for the AM to reply, the 
scheduler will try to run tasks on that executor?

Is there some sort of "soft unregistration" that could be done here so that 
the executor is not used for new tasks, but we still haven't failed the 
existing tasks pending figuring out the exit reason?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10054][Streaming]Add a timeout for laun...

2015-08-18 Thread tdas

Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/8242#discussion_r37355584
  
--- Diff: 
streaming/src/main/scala/org/apache/spark/streaming/scheduler/ReceiverTracker.scala
 ---
@@ -143,6 +149,21 @@ class ReceiverTracker(ssc: StreamingContext, 
skipReceiverLaunch: Boolean = false
*/
   private val receiverPreferredLocations = new HashMap[Int, Option[String]]
 
+  /**
+   * The max timeout to launch a receiver. If a receiver cannot register 
in time, StreamingContext
+   * will be stopped.
+   */
+  private val RECEIVER_LAUNCHING_MAX_TIMEOUT =
--- End diff --

Timeout itself means a max time limit the system will wait. What is the 
meaning of a max timeout?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8167] Make tasks that fail from YARN pr...

Github user mccheah commented on a diff in the pull request:

https://github.com/apache/spark/pull/8007#discussion_r37355322
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala
 ---
@@ -91,6 +92,68 @@ private[spark] abstract class YarnSchedulerBackend(
   }
 
   /**
+   * Override the DriverEndpoint to add extra logic for the case when an 
executor is disconnected.
+   * We should check the cluster manager and find if the loss of the 
executor was caused by YARN
+   * force killing it due to preemption.
+   */
+  private class YarnDriverEndpoint(rpcEnv: RpcEnv, sparkProperties: 
ArrayBuffer[(String, String)])
+  extends DriverEndpoint(rpcEnv, sparkProperties) {
+
+private val pendingDisconnectedExecutors = new HashSet[String]
+private val handleDisconnectedExecutorThreadPool =
+  
ThreadUtils.newDaemonCachedThreadPool("yarn-driver-handle-lost-executor-thread-pool")
+
+/**
+ * When onDisconnected is received at the driver endpoint, the 
superclass DriverEndpoint
+ * handles it by assuming the Executor was lost for a bad reason and 
removes the executor
+ * immediately.
+ *
+ * In YARN's case however it is crucial to talk to the application 
master and ask why the
+ * executor had exited. In particular, the executor may have exited 
due to the executor
+ * having been preempted. If the executor "exited normally" according 
to the application
+ * master then we pass that information down to the TaskSetManager to 
inform the
+ * TaskSetManager that tasks on that lost executor should not count 
towards a job failure.
+ */
+override def onDisconnected(rpcAddress: RpcAddress): Unit = {
+  addressToExecutorId.get(rpcAddress).foreach({ executorId =>
+// onDisconnected could be fired multiple times from the same 
executor while we're
+// asynchronously contacting the AM. So keep track of the 
executors we're trying to
+// find loss reasons for and don't duplicate the work
+if (!pendingDisconnectedExecutors.contains(executorId)) {
+  pendingDisconnectedExecutors.add(executorId)
+  handleDisconnectedExecutorThreadPool.submit(new Runnable() {
+override def run(): Unit = {
+  val executorLossReason =
+  // Check for the loss reason and pass the loss reason to 
driverEndpoint
+
yarnSchedulerEndpoint.askWithRetry[Option[ExecutorLossReason]](
+GetExecutorLossReason(executorId))
+  executorLossReason match {
+case Some(reason) =>
+  
driverEndpoint.askWithRetry[Boolean](RemoveExecutor(executorId, reason))
+case None =>
+  logWarning(s"Attempted to get executor loss reason" +
+s" for $rpcAddress but got no response. Marking as 
slave lost.")
+  
driverEndpoint.askWithRetry[Boolean](RemoveExecutor(executorId, SlaveLost()))
--- End diff --

Definitely don't think that's thread safe. It touches things like 
addressToExecutorId, which as we can see in the onDisconnected method itself is 
accessed in the event loop.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8167] Make tasks that fail from YARN pr...

Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/8007#discussion_r37355106
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala
 ---
@@ -91,6 +92,68 @@ private[spark] abstract class YarnSchedulerBackend(
   }
 
   /**
+   * Override the DriverEndpoint to add extra logic for the case when an 
executor is disconnected.
+   * We should check the cluster manager and find if the loss of the 
executor was caused by YARN
+   * force killing it due to preemption.
+   */
+  private class YarnDriverEndpoint(rpcEnv: RpcEnv, sparkProperties: 
ArrayBuffer[(String, String)])
+  extends DriverEndpoint(rpcEnv, sparkProperties) {
+
+private val pendingDisconnectedExecutors = new HashSet[String]
+private val handleDisconnectedExecutorThreadPool =
+  
ThreadUtils.newDaemonCachedThreadPool("yarn-driver-handle-lost-executor-thread-pool")
+
+/**
+ * When onDisconnected is received at the driver endpoint, the 
superclass DriverEndpoint
+ * handles it by assuming the Executor was lost for a bad reason and 
removes the executor
+ * immediately.
+ *
+ * In YARN's case however it is crucial to talk to the application 
master and ask why the
+ * executor had exited. In particular, the executor may have exited 
due to the executor
+ * having been preempted. If the executor "exited normally" according 
to the application
+ * master then we pass that information down to the TaskSetManager to 
inform the
+ * TaskSetManager that tasks on that lost executor should not count 
towards a job failure.
+ */
+override def onDisconnected(rpcAddress: RpcAddress): Unit = {
+  addressToExecutorId.get(rpcAddress).foreach({ executorId =>
+// onDisconnected could be fired multiple times from the same 
executor while we're
+// asynchronously contacting the AM. So keep track of the 
executors we're trying to
+// find loss reasons for and don't duplicate the work
+if (!pendingDisconnectedExecutors.contains(executorId)) {
+  pendingDisconnectedExecutors.add(executorId)
+  handleDisconnectedExecutorThreadPool.submit(new Runnable() {
+override def run(): Unit = {
+  val executorLossReason =
+  // Check for the loss reason and pass the loss reason to 
driverEndpoint
+
yarnSchedulerEndpoint.askWithRetry[Option[ExecutorLossReason]](
+GetExecutorLossReason(executorId))
+  executorLossReason match {
+case Some(reason) =>
+  
driverEndpoint.askWithRetry[Boolean](RemoveExecutor(executorId, reason))
+case None =>
+  logWarning(s"Attempted to get executor loss reason" +
+s" for $rpcAddress but got no response. Marking as 
slave lost.")
+  
driverEndpoint.askWithRetry[Boolean](RemoveExecutor(executorId, SlaveLost()))
--- End diff --

Can you call `super.removeExecutor()` directly here, instead of doing the 
round-trip through the RPC layer? (Might need to check whether that method is 
thread-safe.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10088] [sql] Add support for "stored as...

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8282#issuecomment-132355128
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41153/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10072][STREAMING] BlockGenerator can de...

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8257#issuecomment-132355177
  
  [Test build #1651 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1651/console)
 for   PR 8257 at commit 
[`cb7bba2`](https://github.com/apache/spark/commit/cb7bba2f3ba1f3af87a55f2fc4f38da142099206).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10088] [sql] Add support for "stored as...

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8282#issuecomment-132355127
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10088] [sql] Add support for "stored as...

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8282#issuecomment-132354966
  
  [Test build #41153 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41153/console)
 for   PR 8282 at commit 
[`2256b43`](https://github.com/apache/spark/commit/2256b430bd7e98ca0bc92dc74bdf7340f9d134cf).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8167] Make tasks that fail from YARN pr...

Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/8007#discussion_r37354864
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala
 ---
@@ -91,6 +92,68 @@ private[spark] abstract class YarnSchedulerBackend(
   }
 
   /**
+   * Override the DriverEndpoint to add extra logic for the case when an 
executor is disconnected.
+   * We should check the cluster manager and find if the loss of the 
executor was caused by YARN
+   * force killing it due to preemption.
+   */
+  private class YarnDriverEndpoint(rpcEnv: RpcEnv, sparkProperties: 
ArrayBuffer[(String, String)])
+  extends DriverEndpoint(rpcEnv, sparkProperties) {
+
+private val pendingDisconnectedExecutors = new HashSet[String]
+private val handleDisconnectedExecutorThreadPool =
+  
ThreadUtils.newDaemonCachedThreadPool("yarn-driver-handle-lost-executor-thread-pool")
+
+/**
+ * When onDisconnected is received at the driver endpoint, the 
superclass DriverEndpoint
+ * handles it by assuming the Executor was lost for a bad reason and 
removes the executor
+ * immediately.
+ *
+ * In YARN's case however it is crucial to talk to the application 
master and ask why the
+ * executor had exited. In particular, the executor may have exited 
due to the executor
+ * having been preempted. If the executor "exited normally" according 
to the application
+ * master then we pass that information down to the TaskSetManager to 
inform the
+ * TaskSetManager that tasks on that lost executor should not count 
towards a job failure.
+ */
+override def onDisconnected(rpcAddress: RpcAddress): Unit = {
+  addressToExecutorId.get(rpcAddress).foreach({ executorId =>
+// onDisconnected could be fired multiple times from the same 
executor while we're
+// asynchronously contacting the AM. So keep track of the 
executors we're trying to
+// find loss reasons for and don't duplicate the work
+if (!pendingDisconnectedExecutors.contains(executorId)) {
+  pendingDisconnectedExecutors.add(executorId)
+  handleDisconnectedExecutorThreadPool.submit(new Runnable() {
+override def run(): Unit = {
+  val executorLossReason =
+  // Check for the loss reason and pass the loss reason to 
driverEndpoint
--- End diff --

nit: indent this more (or move the comment before the `val` declaration).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8167] Make tasks that fail from YARN pr...

Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/8007#discussion_r37354672
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala
 ---
@@ -91,6 +92,51 @@ private[spark] abstract class YarnSchedulerBackend(
   }
 
   /**
+   * Override the DriverEndpoint to add extra logic for the case when an 
executor is disconnected.
+   * We should check the cluster manager and find if the loss of the 
executor was caused by YARN
+   * force killing it due to preemption.
+   */
+  private class YarnDriverEndpoint(rpcEnv: RpcEnv, sparkProperties: 
ArrayBuffer[(String, String)])
--- End diff --

Ah, nevermind. You're extending DriverEndpoint and overriding 
`onDisconnected`. I should read the rest of the code before commenting on 
stuff. :-/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8167] Make tasks that fail from YARN pr...

Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/8007#discussion_r37354752
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala
 ---
@@ -91,6 +92,68 @@ private[spark] abstract class YarnSchedulerBackend(
   }
 
   /**
+   * Override the DriverEndpoint to add extra logic for the case when an 
executor is disconnected.
+   * We should check the cluster manager and find if the loss of the 
executor was caused by YARN
+   * force killing it due to preemption.
+   */
+  private class YarnDriverEndpoint(rpcEnv: RpcEnv, sparkProperties: 
ArrayBuffer[(String, String)])
+  extends DriverEndpoint(rpcEnv, sparkProperties) {
+
+private val pendingDisconnectedExecutors = new HashSet[String]
+private val handleDisconnectedExecutorThreadPool =
+  
ThreadUtils.newDaemonCachedThreadPool("yarn-driver-handle-lost-executor-thread-pool")
+
+/**
+ * When onDisconnected is received at the driver endpoint, the 
superclass DriverEndpoint
+ * handles it by assuming the Executor was lost for a bad reason and 
removes the executor
+ * immediately.
+ *
+ * In YARN's case however it is crucial to talk to the application 
master and ask why the
+ * executor had exited. In particular, the executor may have exited 
due to the executor
+ * having been preempted. If the executor "exited normally" according 
to the application
+ * master then we pass that information down to the TaskSetManager to 
inform the
+ * TaskSetManager that tasks on that lost executor should not count 
towards a job failure.
+ */
+override def onDisconnected(rpcAddress: RpcAddress): Unit = {
+  addressToExecutorId.get(rpcAddress).foreach({ executorId =>
--- End diff --

style: `.foreach { executorId =>`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9954] [MLLIB] use first 64 nonzeros to ...

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/8182#issuecomment-132354472
  
No particular reason. Just want to have less hash collision while keeping 
the computation cheap. 16 is certainly not sufficient. See 
https://en.wikipedia.org/wiki/Java_hashCode()#The_java.lang.String_hash_function.
 In practice, a sparse instance usually has less than a few hundred nonzeros. I 
chose 64 for this reason, but I'm okay with 128, 256, 512, or 1024. However, a 
value greater than 1024 would be too much.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8167] Make tasks that fail from YARN pr...

Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/8007#discussion_r37354476
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala
 ---
@@ -91,6 +92,51 @@ private[spark] abstract class YarnSchedulerBackend(
   }
 
   /**
+   * Override the DriverEndpoint to add extra logic for the case when an 
executor is disconnected.
+   * We should check the cluster manager and find if the loss of the 
executor was caused by YARN
+   * force killing it due to preemption.
+   */
+  private class YarnDriverEndpoint(rpcEnv: RpcEnv, sparkProperties: 
ArrayBuffer[(String, String)])
--- End diff --

Wait, I'm confused. `YarnSchedulerEndpoint` and `DriverEndpoint` are 
different things; `YarnSchedulerEndpoint` is a communication channel between 
the driver in YARN mode and the YARN AM, there are no executors involved. Why 
wouldn't that work here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10082] [MLlib] Validate i, j in apply D...

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8271#issuecomment-132353694
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41159/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10082] [MLlib] Validate i, j in apply D...

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8271#issuecomment-132353692
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10082] [MLlib] Validate i, j in apply D...

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8271#issuecomment-132353469
  
  [Test build #41159 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41159/console)
 for   PR 8271 at commit 
[`a92c382`](https://github.com/apache/spark/commit/a92c38287c273d82d3e22cf35ebc8216f33d0b2d).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8167] Make tasks that fail from YARN pr...

Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/8007#discussion_r37353932
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala ---
@@ -809,9 +825,14 @@ private[spark] class TaskSetManager(
 }
   }
 }
-// Also re-enqueue any tasks that were running on the node
 for ((tid, info) <- taskInfos if info.running && info.executorId == 
execId) {
-  handleFailedTask(tid, TaskState.FAILED, ExecutorLostFailure(execId))
+  // Also re-enqueue any tasks that were running on the node
+  val executorFailureReason = reason match {
+case exited: ExecutorExitedNormally =>
--- End diff --

This would go away if you follow my suggestion of merging the two errors, 
but in any case, since it's a case class:

case ExecutorExitedNormally(exitCode, reason) =>




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8167] Make tasks that fail from YARN pr...

Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/8007#discussion_r37353655
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala ---
@@ -739,6 +740,21 @@ private[spark] class TaskSetManager(
 maybeFinishTaskSet()
   }
 
+
+  /**
+   * Determine if this task failure reason should count towards failing 
the job. All tasks which
+   * end prematurely should be counted as failures that should kill a job 
except:
+   *
+   * 1. The task fails because its attempt to commit was denied, 
preventing spurious stage failures
+   *in cases where many speculative tasks are launched and denied to 
commit, and
+   *
+   * 2. A task failed because its executor exited normally before 
completing the task. For example,
+   *the cluster manager may have asked the executor to release its 
resources and shut down.
+   */
+  private def shouldTaskFailureEventuallyFailJob(reason: TaskEndReason): 
Boolean = {
--- End diff --

Feels like this should be a property of `TaskEndReason`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10089] [sql] Add missing golden files.

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8283#issuecomment-132351442
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10089] [sql] Add missing golden files.

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8283#issuecomment-132351447
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41154/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10089] [sql] Add missing golden files.

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8283#issuecomment-132351095
  
  [Test build #41154 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41154/console)
 for   PR 8283 at commit 
[`be061b3`](https://github.com/apache/spark/commit/be061b3da928d645e2029ef37ac661a4cb84bb24).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10095] [SQL] use public API of BigInteg...

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8286#issuecomment-132351145
  
  [Test build #41164 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41164/consoleFull)
 for   PR 8286 at commit 
[`e547fe8`](https://github.com/apache/spark/commit/e547fe80f59f83fe2b3934215975f9180c5da164).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8167] Make tasks that fail from YARN pr...

Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/8007#discussion_r37353473
  
--- Diff: core/src/main/scala/org/apache/spark/TaskEndReason.scala ---
@@ -208,6 +208,22 @@ case class ExecutorLostFailure(execId: String) extends 
TaskFailedReason {
 
 /**
  * :: DeveloperApi ::
+ * The task failed because the executor that it was running on was 
prematurely terminated. The
+ * executor is forcibly exited but the exit should be considered as part 
of normal cluster
+ * behavior.
+ */
+@DeveloperApi
+case class ExecutorNormalExit(
--- End diff --

I'd give the same feedback here, but then `ExecutorLostFailure` is a 
developer API... still I think that a single reason (with a boolean saying 
whether to treat it as an error) would be simpler.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10095] [SQL] use public API of BigInteg...

2015-08-18 Thread srowen

Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/8286#issuecomment-132350637
  
+1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8473][SPARK-9889][ML] User guide and ex...

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8184#issuecomment-132350414
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41163/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8473][SPARK-9889][ML] User guide and ex...

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8184#issuecomment-132350411
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9906] [ML] User guide for LogisticRegre...

2015-08-18 Thread MechCoder

Github user MechCoder commented on the pull request:

https://github.com/apache/spark/pull/8197#issuecomment-132350267
  
Okay. But he had to told something else in that PR discussion :p. I do
agree that doing model.binarySUmmary is much nearer than
model.asInstanceOf[]â¦
On Aug 19, 2015 2:27 AM, "UCB AMPLab"  wrote:

> Merged build finished. Test PASSed.
>
> â
> Reply to this email directly or view it on GitHub
> .
>



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8473][SPARK-9889][ML] User guide and ex...

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8184#issuecomment-132350174
  
  [Test build #41163 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41163/console)
 for   PR 8184 at commit 
[`dfb3b2f`](https://github.com/apache/spark/commit/dfb3b2ffe8928142d8e1e96c9a45968056d2336d).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10095] [SQL] use public API of BigInteg...

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8286#issuecomment-132349838
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10095] [SQL] use public API of BigInteg...

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8286#issuecomment-132349770
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8167] Make tasks that fail from YARN pr...

Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/8007#discussion_r37353128
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/ExecutorLossReason.scala ---
@@ -23,13 +23,29 @@ import org.apache.spark.executor.ExecutorExitCode
  * Represents an explanation for a executor or whole slave failing or 
exiting.
  */
 private[spark]
-class ExecutorLossReason(val message: String) {
+class ExecutorLossReason(val message: String) extends Serializable {
   override def toString: String = message
 }
 
+private[spark] case class ExecutorExitedAbnormally(val exitCode: Int, 
reason: String)
+  extends ExecutorLossReason(reason) {
+}
+
+private[spark] object ExecutorExitedAbnormally {
+  def apply(exitCode: Int): ExecutorExitedAbnormally = {
+ExecutorExitedAbnormally(exitCode, 
ExecutorExitCode.explainExitCode(exitCode))
+  }
+}
+
 private[spark]
-case class ExecutorExited(val exitCode: Int)
-  extends ExecutorLossReason(ExecutorExitCode.explainExitCode(exitCode)) {
+case class ExecutorExitedNormally(val exitCode: Int, reason: String)
+  extends ExecutorLossReason(reason) {
+}
+
+private[spark] object ExecutorExitedNormally {
--- End diff --

I don't know, I find `ExecutorExitedAbnormally` and 
`ExecutorExitedNormally` a little confusing, since internally they hold exactly 
the same data (even the same reason message). What if there was only 
`ExecutorExited` with a parameter saying whether it should be treated as an 
error or not?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10095] [SQL] use public API of BigInteg...

2015-08-18 Thread davies

GitHub user davies opened a pull request:

https://github.com/apache/spark/pull/8286

[SPARK-10095] [SQL] use public API of BigInteger

In UnsafeRow, we use the private field of BigInteger for better 
performance, but it actually didn't contribute much (3% in one benchmark) to 
end-to-end runtime, and make it not portable (may fail on other JVM 
implementations).

So we should use the public API instead.

cc @rxin 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/davies/spark portable_decimal

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/8286.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #8286


commit e547fe80f59f83fe2b3934215975f9180c5da164
Author: Davies Liu 
Date:   2015-08-18T20:59:58Z

use public API of BigInteger




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10017] [MLlib]: ML model broadcasts sho...

2015-08-18 Thread holdenk

Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/8241#discussion_r37352728
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala ---
@@ -83,9 +86,12 @@ class NaiveBayesModel private[spark] (
   }
 
   override def predict(testData: RDD[Vector]): RDD[Double] = {
-val bcModel = testData.context.broadcast(this)
+bcModel match {
+  case None => bcModel = Some(testData.context.broadcast(this))
+  case _ =>
+}
 testData.mapPartitions { iter =>
-  val model = bcModel.value
+  val model = bcModel.get.value
--- End diff --

I believe we will still want to make a local reference to bcModel here so 
we don't end up shipping the entire object across the wire.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8167] Make tasks that fail from YARN pr...

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8007#issuecomment-132348292
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41152/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8167] Make tasks that fail from YARN pr...

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8007#issuecomment-132348290
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8167] Make tasks that fail from YARN pr...

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8007#issuecomment-132348217
  
  [Test build #41152 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41152/console)
 for   PR 8007 at commit 
[`70d6a15`](https://github.com/apache/spark/commit/70d6a1587906210ea4451fb1743b8eda6e7b90c4).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `case class ExecutorNormalExit(`
  * `class ExecutorLossReason(val message: String) extends Serializable `
  * `case class ExecutorExitedNormally(val exitCode: Int, reason: String)`
  * `  case class RemoveExecutor(executorId: String, reason: 
ExecutorLossReason)`
  * `  case class AcknowledgeExecutorRemoved(executorId: String) extends 
CoarseGrainedClusterMessage`
  * `  case class GetExecutorLossReason(executorId: String) extends 
CoarseGrainedClusterMessage`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9906] [ML] User guide for LogisticRegre...

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8197#issuecomment-132348205
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41162/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9906] [ML] User guide for LogisticRegre...

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8197#issuecomment-132348201
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9906] [ML] User guide for LogisticRegre...

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8197#issuecomment-132348051
  
  [Test build #41162 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41162/console)
 for   PR 8197 at commit 
[`7bf922c`](https://github.com/apache/spark/commit/7bf922c53b0e7f6e6d5304107f432b58ad7b93c7).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9833] [yarn] Add options to disable del...

Github user vanzin commented on the pull request:

https://github.com/apache/spark/pull/8134#issuecomment-132347459
  
@tgravescs I chose a slightly different name than you suggested, how does 
that sound?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10080][SQL] Fix binary incompatibility ...

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/8281


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10080][SQL] Fix binary incompatibility ...

2015-08-18 Thread marmbrus

Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/8281#issuecomment-132347116
  
Merged to master and 1.5


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10087] [CORE] Disable spark.shuffle.red...

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8280#issuecomment-132346850
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41149/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10087] [CORE] Disable spark.shuffle.red...

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8280#issuecomment-132346849
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10087] [CORE] Disable spark.shuffle.red...

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8280#issuecomment-132346765
  
  [Test build #41149 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41149/console)
 for   PR 8280 at commit 
[`f77e574`](https://github.com/apache/spark/commit/f77e574dd749c0c140ee71e4aaa143abbfcc6d56).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8473][SPARK-9889][ML] User guide and ex...

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8184#issuecomment-132346686
  
  [Test build #41163 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41163/consoleFull)
 for   PR 8184 at commit 
[`dfb3b2f`](https://github.com/apache/spark/commit/dfb3b2ffe8928142d8e1e96c9a45968056d2336d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8505][SparkR] Add settings to kick `lin...

2015-08-18 Thread shivaram

Github user shivaram commented on the pull request:

https://github.com/apache/spark/pull/7883#issuecomment-132345408
  
welcome back @shaneknapp !


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10087] [CORE] Disable spark.shuffle.red...

2015-08-18 Thread shivaram

Github user shivaram commented on the pull request:

https://github.com/apache/spark/pull/8280#issuecomment-132345128
  
The diff I'm proposing is something like 
```
+val numShuffleDeps = 
rdd.dependencies.filter(_.isInstanceOf[ShuffleDependency[_, _, _]]).length
+
 // If the RDD has shuffle dependencies and shuffle locality is 
enabled, pick locations that
 // have at least REDUCER_PREF_LOCS_FRACTION of data as preferred 
locations
-if (shuffleLocalityEnabled && rdd.partitions.length < 
SHUFFLE_PREF_REDUCE_THRESHOLD) {
+if (numShuffleDeps == 1 && shuffleLocalityEnabled &&
+rdd.partitions.length < SHUFFLE_PREF_REDUCE_THRESHOLD) {
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9772] [PySpark] [ML] Add Python API for...

Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/8102#issuecomment-132345003
  
One more comment: need to add VectorSlicer to list ```__all__``` at top of 
file


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9772] [PySpark] [ML] Add Python API for...

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/8102#discussion_r37350810
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -950,6 +950,92 @@ class VectorIndexerModel(JavaModel):
 
 
 @inherit_doc
+class VectorSlicer(JavaTransformer, HasInputCol, HasOutputCol):
+"""
+.. note:: Experimental
+
+This class takes a feature vector and outputs a new feature vector 
with a subarray
+of the original features.
+
+The subset of features can be specified with either indices 
(`setIndices()`)
+or names (`setNames()`).  At least one feature must be selected. 
Duplicate features
+are not allowed, so there can be no overlap between selected indices 
and names.
+
+The output vector will order features with the selected indices first 
(in the order given),
+followed by the selected names (in the order given).
+
+>>> from pyspark.mllib.linalg import Vectors
+>>> df = sqlContext.createDataFrame([
+... (Vectors.dense([-2.0, 2.3, 0.0, 0.0, 1.0]),),
+... (Vectors.dense([0.0, 0.0, 0.0, 0.0, 0.0]),),
+... (Vectors.dense([0.6, -1.1, -3.0, 4.5, 3.3]),)], ["features"])
+>>> vs = VectorSlicer(inputCol="features", outputCol="expected", 
indices=[1, 4])
+>>> vs.transform(df).head().expected
+DenseVector([2.3, 1.0])
+"""
+
+# a placeholder to make it appear in the generated doc
+indices = Param(Params._dummy(), "indices", "An array of indices to 
select features from " +
+"a vector column. There can be no overlap with 
`names`.")
+names = Param(Params._dummy(), "names", "An array of feature names to 
select features from " +
+  "a vector column. These names must be specified by ML " +
+  "`org.apache.spark.ml.attribute.Attribute`s. There can 
be no overlap with " +
+  "`indices`.")
+
+@keyword_only
+def __init__(self, inputCol=None, outputCol=None, indices=[], 
names=[]):
--- End diff --

In Python, we should avoid using mutable values ```[]``` as defaults.  
Let's use None.  The Scala API should take care of the defaults.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8473][SPARK-9889][ML] User guide and ex...

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8184#issuecomment-132344654
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9772] [PySpark] [ML] Add Python API for...

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/8102#discussion_r37350798
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -950,6 +950,92 @@ class VectorIndexerModel(JavaModel):
 
 
 @inherit_doc
+class VectorSlicer(JavaTransformer, HasInputCol, HasOutputCol):
+"""
+.. note:: Experimental
+
+This class takes a feature vector and outputs a new feature vector 
with a subarray
+of the original features.
+
+The subset of features can be specified with either indices 
(`setIndices()`)
+or names (`setNames()`).  At least one feature must be selected. 
Duplicate features
+are not allowed, so there can be no overlap between selected indices 
and names.
+
+The output vector will order features with the selected indices first 
(in the order given),
+followed by the selected names (in the order given).
+
+>>> from pyspark.mllib.linalg import Vectors
+>>> df = sqlContext.createDataFrame([
+... (Vectors.dense([-2.0, 2.3, 0.0, 0.0, 1.0]),),
+... (Vectors.dense([0.0, 0.0, 0.0, 0.0, 0.0]),),
+... (Vectors.dense([0.6, -1.1, -3.0, 4.5, 3.3]),)], ["features"])
+>>> vs = VectorSlicer(inputCol="features", outputCol="expected", 
indices=[1, 4])
--- End diff --

Rename "expected" to "sliced" since this is an example


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8473][SPARK-9889][ML] User guide and ex...

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8184#issuecomment-132344710
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9772] [PySpark] [ML] Add Python API for...

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/8102#discussion_r37350812
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -950,6 +950,92 @@ class VectorIndexerModel(JavaModel):
 
 
 @inherit_doc
+class VectorSlicer(JavaTransformer, HasInputCol, HasOutputCol):
+"""
+.. note:: Experimental
+
+This class takes a feature vector and outputs a new feature vector 
with a subarray
+of the original features.
+
+The subset of features can be specified with either indices 
(`setIndices()`)
+or names (`setNames()`).  At least one feature must be selected. 
Duplicate features
+are not allowed, so there can be no overlap between selected indices 
and names.
+
+The output vector will order features with the selected indices first 
(in the order given),
+followed by the selected names (in the order given).
+
+>>> from pyspark.mllib.linalg import Vectors
+>>> df = sqlContext.createDataFrame([
+... (Vectors.dense([-2.0, 2.3, 0.0, 0.0, 1.0]),),
+... (Vectors.dense([0.0, 0.0, 0.0, 0.0, 0.0]),),
+... (Vectors.dense([0.6, -1.1, -3.0, 4.5, 3.3]),)], ["features"])
+>>> vs = VectorSlicer(inputCol="features", outputCol="expected", 
indices=[1, 4])
+>>> vs.transform(df).head().expected
+DenseVector([2.3, 1.0])
+"""
+
+# a placeholder to make it appear in the generated doc
+indices = Param(Params._dummy(), "indices", "An array of indices to 
select features from " +
+"a vector column. There can be no overlap with 
`names`.")
+names = Param(Params._dummy(), "names", "An array of feature names to 
select features from " +
+  "a vector column. These names must be specified by ML " +
+  "`org.apache.spark.ml.attribute.Attribute`s. There can 
be no overlap with " +
+  "`indices`.")
+
+@keyword_only
+def __init__(self, inputCol=None, outputCol=None, indices=[], 
names=[]):
+"""
+__init__(self, inputCol=None, outputCol=None, indices=[], names=[])
+"""
+super(VectorSlicer, self).__init__()
+self._java_obj = 
self._new_java_obj("org.apache.spark.ml.feature.VectorSlicer", self.uid)
+self.indices = Param(self, "indices", "An array of indices to 
select features from " +
+ "a vector column. There can be no overlap 
with `names`.")
+self.names = Param(self, "names", "An array of feature names to 
select features from " +
+   "a vector column. These names must be specified 
by ML " +
+   "`org.apache.spark.ml.attribute.Attribute`s. 
There can be no overlap " +
+   "with `indices`.")
+self._setDefault(indices=[], names=[])
--- End diff --

so no defaults here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9772] [PySpark] [ML] Add Python API for...

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/8102#discussion_r37350805
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -950,6 +950,92 @@ class VectorIndexerModel(JavaModel):
 
 
 @inherit_doc
+class VectorSlicer(JavaTransformer, HasInputCol, HasOutputCol):
+"""
+.. note:: Experimental
+
+This class takes a feature vector and outputs a new feature vector 
with a subarray
+of the original features.
+
+The subset of features can be specified with either indices 
(`setIndices()`)
+or names (`setNames()`).  At least one feature must be selected. 
Duplicate features
+are not allowed, so there can be no overlap between selected indices 
and names.
+
+The output vector will order features with the selected indices first 
(in the order given),
+followed by the selected names (in the order given).
+
+>>> from pyspark.mllib.linalg import Vectors
+>>> df = sqlContext.createDataFrame([
+... (Vectors.dense([-2.0, 2.3, 0.0, 0.0, 1.0]),),
+... (Vectors.dense([0.0, 0.0, 0.0, 0.0, 0.0]),),
+... (Vectors.dense([0.6, -1.1, -3.0, 4.5, 3.3]),)], ["features"])
+>>> vs = VectorSlicer(inputCol="features", outputCol="expected", 
indices=[1, 4])
+>>> vs.transform(df).head().expected
+DenseVector([2.3, 1.0])
+"""
+
+# a placeholder to make it appear in the generated doc
+indices = Param(Params._dummy(), "indices", "An array of indices to 
select features from " +
+"a vector column. There can be no overlap with 
`names`.")
+names = Param(Params._dummy(), "names", "An array of feature names to 
select features from " +
+  "a vector column. These names must be specified by ML " +
+  "`org.apache.spark.ml.attribute.Attribute`s. There can 
be no overlap with " +
--- End diff --

No need for backtick within strings; the backtick is only for formatting 
generated API docs.  (here and elsewhere)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10080][SQL] Fix binary incompatibility ...

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8281#issuecomment-132344114
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9906] [ML] User guide for LogisticRegre...

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8197#issuecomment-132344168
  
  [Test build #41162 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41162/consoleFull)
 for   PR 8197 at commit 
[`7bf922c`](https://github.com/apache/spark/commit/7bf922c53b0e7f6e6d5304107f432b58ad7b93c7).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10093][SQL] Avoid transformation on exe...

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8285#issuecomment-132344077
  
  [Test build #41161 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41161/consoleFull)
 for   PR 8285 at commit 
[`e8b8240`](https://github.com/apache/spark/commit/e8b8240d389782bfc0e75cbe1797ce5aecc47092).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10080][SQL] Fix binary incompatibility ...

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8281#issuecomment-132344119
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41150/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10080][SQL] Fix binary incompatibility ...

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8281#issuecomment-132343931
  
  [Test build #41150 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41150/console)
 for   PR 8281 at commit 
[`541d9a0`](https://github.com/apache/spark/commit/541d9a016b125a3fbbef5cdf97ee3ff9db78b8a0).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `implicit class StringToColumn(val sc: StringContext) `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8505][SparkR] Add settings to kick `lin...

2015-08-18 Thread shaneknapp

Github user shaneknapp commented on the pull request:

https://github.com/apache/spark/pull/7883#issuecomment-132343883
  
found one directory on amp-jenkins-worker-01 that's polluted -- deleting it
now, and this should fix any builds that run there.

On Mon, Aug 17, 2015 at 9:36 PM, shane knapp â   
wrote:

> On Mon, Aug 17, 2015 at 10:11 AM, Shivaram Venkataraman <
> notificati...@github.com> wrote:
>
>> @JoshRosen  There seems to be some problem
>> on some of the Jenkins workers and we get errors which look like
>>
>> running git clean -fdx
>> warning: failed to remove 'target/'
>> Removing target/
>> Build step 'Execute shell' marked build as failure
>>
>> I've seen this in other PRs as well -- Any ideas what is causing this ?
>>
>>
>> somehow the spark builds are creating directories w/the wrong permissions
> (missing the owner write bit), meaning that the directory created from a
> previous build can't be deleted and thereby fails the build.
>
> i'll go through all of the workers/spark build dirs first thing tomorrow
> and fix this.
>



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9906] [ML] User guide for LogisticRegre...

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8197#issuecomment-132343523
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9906] [ML] User guide for LogisticRegre...

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8197#issuecomment-132343547
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10093][SQL] Avoid transformation on exe...

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8285#issuecomment-132343518
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9574][Streaming]Remove unnecessary cont...

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/8069


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10093][SQL] Avoid transformation on exe...

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8285#issuecomment-132343525
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9574][Streaming]Remove unnecessary cont...

2015-08-18 Thread tdas

Github user tdas commented on the pull request:

https://github.com/apache/spark/pull/8069#issuecomment-132343048
  
I am merging this to master and 1.5, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10093][SQL] Avoid transformation on exe...