date:20150818

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8280#issuecomment-132305977
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8167] Make tasks that fail from YARN pr...

Github user mccheah commented on the pull request:

https://github.com/apache/spark/pull/8007#issuecomment-132307731
  
Cool - and once again, trying the new commit would be appreciated.

Also @markgrover  how do we want to resolve all of the duplicate work being 
done here and in #8093? Should we try to merge this commit first and have your 
commit be rebased on top of this? Or should it be the other way around?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9906] [ML] User guide for LogisticRegre...

Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/8197#discussion_r37336212
  
--- Diff: docs/ml-linear-methods.md ---
@@ -118,12 +133,114 @@ lrModel = lr.fit(training)
 print(Weights:  + str(lrModel.weights))
 print(Intercept:  + str(lrModel.intercept))
 {% endhighlight %}
+/div
 
 /div
 
+The `spark.ml` implementation of logistic regression also supports
+extracting a summary of the model over the training set. Note that the
+predictions and metrics which are stored as `Datafram`s in
--- End diff --

Whoops, yep that's right


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8473][SPARK-9889][ML] User guide and ex...

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/8184#discussion_r37337185
  
--- Diff: docs/ml-features.md ---
@@ -649,6 +649,80 @@ for expanded in polyDF.select(polyFeatures).take(3):
 /div
 /div
 
+## Discrete Cosine Transform (DCT)
+
+The [Discrete Cosine
+Transform](https://en.wikipedia.org/wiki/Discrete_cosine_transform)
+transforms a length $N$ real-valued sequence in the time domain into
+another length $N$ real-valued sequence in the frequency domain. A
+[DCT](api/scala/index.html#org.apache.spark.ml.feature.DCT) class
+provides this functionality, implementing the
+[DCT-II](https://en.wikipedia.org/wiki/Discrete_cosine_transform#DCT-II)
+and scaling the result by $1/\sqrt{2}$ such that the representing matrix
+for the transform is unitary. No shift is applied to the transformed
+sequence (e.g. the $0$th element of the transformed sequence is the
+$0$th DCT coefficient and _not_ the $N/2$th).
+
+div class=codetabs
+div data-lang=scala markdown=1
+{% highlight scala %}
+import org.apache.spark.ml.feature.DCT
+import org.apache.spark.mllib.linalg.Vectors
+
+val data = Seq(
+  Vectors.dense(0.0, 1.0, -2.0, 3.0),
+  Vectors.dense(-1.0, 2.0, 4.0, -7.0),
+  Vectors.dense(14.0, -2.0, -5.0, 1.0))
+val df = 
sqlContext.createDataFrame(data.map(Tuple1.apply)).toDF(features)
+val dct = new DCT()
+  .setInputCol(features)
+  .setOutputCol(featuresDCT)
+  .setInverse(false)
+val dctDf = dct.transform(df)
+dctDf.select(featuresDCT).take(3).foreach(println)
+{% endhighlight %}
+/div
+
+div data-lang=java markdown=1
+{% highlight java %}
+import java.util.Arrays;
+
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.ml.feature.DCT;
+import org.apache.spark.mllib.linalg.Vector;
+import org.apache.spark.mllib.linalg.VectorUDT;
+import org.apache.spark.mllib.linalg.Vectors;
+import org.apache.spark.sql.DataFrame;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.RowFactory;
+import org.apache.spark.sql.SQLContext;
+import org.apache.spark.sql.types.Metadata;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+
+JavaRDDRow data = jsc.parallelize(Arrays.asList(
+  RowFactory.create(Vectors.dense(0.0, 1.0, -2.0, 3.0)),
--- End diff --

2 space indentation


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9952] Fix N^2 loop when DAGScheduler.ge...

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8178#issuecomment-132319392
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41138/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10087] [CORE] Disable spark.shuffle.red...

2015-08-18 Thread rxin

Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/8280#issuecomment-132320481
  
cc @shivaram 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10082] [MLlib] Validate i, j in apply D...

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8271#issuecomment-132320541
  
  [Test build #41155 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41155/consoleFull)
 for   PR 8271 at commit 
[`cf7f509`](https://github.com/apache/spark/commit/cf7f5091174247ea945b8b4ae01eb02f64a07711).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10089] [sql] Add missing golden files.

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8283#issuecomment-132320531
  
  [Test build #41154 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41154/consoleFull)
 for   PR 8283 at commit 
[`be061b3`](https://github.com/apache/spark/commit/be061b3da928d645e2029ef37ac661a4cb84bb24).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10087] [CORE] Disable spark.shuffle.red...

2015-08-18 Thread yhuai

Github user yhuai commented on the pull request:

https://github.com/apache/spark/pull/8280#issuecomment-132324006
  
The reduce stage i has a 2-way join in it. The two map stages had 30 and 1 
tasks, respectively. For the stage having 30 tasks, here is the screenshot of 
task info


![image](https://cloud.githubusercontent.com/assets/2072857/9340299/c3c52c7a-45a3-11e5-8ee8-425fcd44612c.png)

For the stage having 1 task, here is the screenshot of task info


![image](https://cloud.githubusercontent.com/assets/2072857/9340324/e332f010-45a3-11e5-97e9-40adb5461975.png)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10087] [CORE] Disable spark.shuffle.red...

2015-08-18 Thread yhuai

Github user yhuai commented on the pull request:

https://github.com/apache/spark/pull/8280#issuecomment-132328891
  
ah, sorry i missed the reducer stage's screenshot. Yes, executor 23 was the 
one got all reduce tasks.


![image](https://cloud.githubusercontent.com/assets/2072857/9340710/48fbb2a4-45a6-11e5-87fc-41d8b41ef6a6.png)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10087] [CORE] Disable spark.shuffle.red...

2015-08-18 Thread shivaram

Github user shivaram commented on the pull request:

https://github.com/apache/spark/pull/8280#issuecomment-132331662
  
So my hypothesis right now is that the RDD in the reduce stage has two 
Shuffle dependencies and the first shuffle dependency happens to be the single 
map task stage -- so the locality preference ends up giving all the tasks to 
the single host.

Hmm so my guess is that we need to be able to differentiate among different 
shuffle dependencies ideally. Here is another suggestion: Can we turn this off 
if we have more than one shuffle dependency ? it should be pretty cheap to 
count that 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9439] [yarn] External shuffle service r...

2015-08-18 Thread squito

Github user squito commented on the pull request:

https://github.com/apache/spark/pull/7943#issuecomment-132333169
  
Jenkins, retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10085] [MLlib] [Docs] removed unnecessa...

2015-08-18 Thread mengxr

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/8284#issuecomment-132333113
  
LGTM. Merged into master and branch-1.5. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10085] [MLlib] [Docs] removed unnecessa...

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/8284


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10085] [MLlib] [Docs] removed unnecessa...

2015-08-18 Thread stared

Github user stared commented on the pull request:

https://github.com/apache/spark/pull/8284#issuecomment-132333595
  
Wow, it was quick! Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9893] user guide for VectorSlicer

Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/8267#discussion_r37346802
  
--- Diff: docs/ml-features.md ---
@@ -1389,3 +1389,145 @@ print(output.select(features, clicked).first())
 
 # Feature Selectors
 
+## VectorSlicer
+
+`VectorSlicer` is a transformer that takes a feature vector and outputs a 
new feature vector with a sub-array of the original features. It is useful for 
extracting features from a vector column.
+
+`VectorSlicer` accepts a vector column with a specified indices, then 
outputs a new vector column whose values are selected via those indices. There 
are two types of indices, 
+
+ 1. Integer indices that represents the real indices in the vector, 
`setIndices()`;
+
+ 2. String indices that represents the names of features in the vector, 
`setNames()`.
+
+Specify by integer and string are both acceptable, moreover, you can use 
integer index and string name simultaneously. At least one feature must be 
selected. Duplicate features are not allowed, so there can be no overlap 
between selected indices and names. Note that if names of features are 
selected, an exception will be threw out when encountering with empty input 
attributes.
--- End diff --

nit: ***Specification*** by integer and string are both acceptable***. 
M***oreover,


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9439] [yarn] External shuffle service r...

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7943#issuecomment-132335105
  
  [Test build #41158 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41158/consoleFull)
 for   PR 7943 at commit 
[`0d285d3`](https://github.com/apache/spark/commit/0d285d3fac15afc77313255799a3392dcf74518f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8473][SPARK-9889][ML] User guide and ex...

Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/8184#discussion_r37349464
  
--- Diff: docs/ml-features.md ---
@@ -649,6 +649,80 @@ for expanded in polyDF.select(polyFeatures).take(3):
 /div
 /div
 
+## Discrete Cosine Transform (DCT)
+
+The [Discrete Cosine
+Transform](https://en.wikipedia.org/wiki/Discrete_cosine_transform)
+transforms a length $N$ real-valued sequence in the time domain into
+another length $N$ real-valued sequence in the frequency domain. A
+[DCT](api/scala/index.html#org.apache.spark.ml.feature.DCT) class
+provides this functionality, implementing the
+[DCT-II](https://en.wikipedia.org/wiki/Discrete_cosine_transform#DCT-II)
+and scaling the result by $1/\sqrt{2}$ such that the representing matrix
+for the transform is unitary. No shift is applied to the transformed
+sequence (e.g. the $0$th element of the transformed sequence is the
+$0$th DCT coefficient and _not_ the $N/2$th).
+
+div class=codetabs
+div data-lang=scala markdown=1
+{% highlight scala %}
+import org.apache.spark.ml.feature.DCT
+import org.apache.spark.mllib.linalg.Vectors
+
+val data = Seq(
+  Vectors.dense(0.0, 1.0, -2.0, 3.0),
+  Vectors.dense(-1.0, 2.0, 4.0, -7.0),
+  Vectors.dense(14.0, -2.0, -5.0, 1.0))
+val df = 
sqlContext.createDataFrame(data.map(Tuple1.apply)).toDF(features)
+val dct = new DCT()
+  .setInputCol(features)
+  .setOutputCol(featuresDCT)
+  .setInverse(false)
+val dctDf = dct.transform(df)
+dctDf.select(featuresDCT).take(3).foreach(println)
+{% endhighlight %}
+/div
+
+div data-lang=java markdown=1
+{% highlight java %}
+import java.util.Arrays;
+
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.ml.feature.DCT;
+import org.apache.spark.mllib.linalg.Vector;
+import org.apache.spark.mllib.linalg.VectorUDT;
+import org.apache.spark.mllib.linalg.Vectors;
+import org.apache.spark.sql.DataFrame;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.RowFactory;
+import org.apache.spark.sql.SQLContext;
+import org.apache.spark.sql.types.Metadata;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+
+JavaRDDRow data = jsc.parallelize(Arrays.asList(
+  RowFactory.create(Vectors.dense(0.0, 1.0, -2.0, 3.0)),
--- End diff --

OK


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9906] [ML] User guide for LogisticRegre...

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8197#issuecomment-132343523
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10087] [CORE] Disable spark.shuffle.red...

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8280#issuecomment-132346850
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41149/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10088] [sql] Add support for stored as...

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8282#issuecomment-132354966
  
  [Test build #41153 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41153/console)
 for   PR 8282 at commit 
[`2256b43`](https://github.com/apache/spark/commit/2256b430bd7e98ca0bc92dc74bdf7340f9d134cf).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8167] Make tasks that fail from YARN pr...

Github user mccheah commented on a diff in the pull request:

https://github.com/apache/spark/pull/8007#discussion_r37358327
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala
 ---
@@ -91,6 +92,66 @@ private[spark] abstract class YarnSchedulerBackend(
   }
 
   /**
+   * Override the DriverEndpoint to add extra logic for the case when an 
executor is disconnected.
+   * We should check the cluster manager and find if the loss of the 
executor was caused by YARN
+   * force killing it due to preemption.
+   */
+  private class YarnDriverEndpoint(rpcEnv: RpcEnv, sparkProperties: 
ArrayBuffer[(String, String)])
+  extends DriverEndpoint(rpcEnv, sparkProperties) {
+
+private val pendingDisconnectedExecutors = new HashSet[String]
+private val handleDisconnectedExecutorThreadPool =
+  
ThreadUtils.newDaemonCachedThreadPool(yarn-driver-handle-lost-executor-thread-pool)
+
+/**
+ * When onDisconnected is received at the driver endpoint, the 
superclass DriverEndpoint
+ * handles it by assuming the Executor was lost for a bad reason and 
removes the executor
+ * immediately.
+ *
+ * In YARN's case however it is crucial to talk to the application 
master and ask why the
+ * executor had exited. In particular, the executor may have exited 
due to the executor
+ * having been preempted. If the executor exited normally according 
to the application
+ * master then we pass that information down to the TaskSetManager to 
inform the
+ * TaskSetManager that tasks on that lost executor should not count 
towards a job failure.
+ */
+override def onDisconnected(rpcAddress: RpcAddress): Unit = {
--- End diff --

To @markgrover's question, yes, by overriding the method then only this 
implementation will be invoked.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8918] [MLLIB] [DOC] Add @since tags to ...

2015-08-18 Thread mengxr

GitHub user mengxr opened a pull request:

https://github.com/apache/spark/pull/8288

[SPARK-8918] [MLLIB] [DOC] Add @since tags to mllib.clustering

This continues the work from #8256. I removed `@since` tags from 
private/protected/local methods/variables (see 
https://github.com/apache/spark/commit/72fdeb64630470f6f46cf3eed8ffbfe83a7c4659).
 @MechCoder 

Closes #8256

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mengxr/spark SPARK-8918

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/8288.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #8288


commit 6bcb09ba21bf5af6a5662e719abd0eb488c3a6c1
Author: Xiaoqing Wang spark...@126.com
Date:   2015-08-17T23:02:26Z

SPARK-8918 Add @since tags to mllib.clustering

commit c679b975cf2fe5d463bee4ca665dced7504af107
Author: Xiaoqing Wang spark...@126.com
Date:   2015-08-18T13:55:58Z

update code style,tab replaced by blank

commit e430de9fa576ae8752f7d90b342026ff84fab8b5
Author: Xiaoqing Wang spark...@126.com
Date:   2015-08-18T14:45:41Z

update code style : delete the Whitespace at end of line

commit e94968ad540c3e7cd15d7e2015d6705503e459e1
Author: Xiangrui Meng m...@databricks.com
Date:   2015-08-18T22:00:54Z

Merge remote-tracking branch 'apache/master' into XiaoqingWang-SPARK-8918

commit 72fdeb64630470f6f46cf3eed8ffbfe83a7c4659
Author: Xiangrui Meng m...@databricks.com
Date:   2015-08-18T22:05:01Z

remove since tags from private vars




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8918] [MLLIB] [DOC] Add @since tags to ...

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8288#issuecomment-132371671
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10090] [SQL] fix decimal scale of divis...

2015-08-18 Thread davies

GitHub user davies opened a pull request:

https://github.com/apache/spark/pull/8287

[SPARK-10090] [SQL] fix decimal scale of division

In TPCDS Q59, the result should be DecimalType(37, 20), but got 
Decimal('0.69903637110664268591656984574863203607'), should be 
Decimal('0.69903637110664268592').

TODO: add regression tests (we have low coverage for DecimalType in Cast)

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/davies/spark decimal_division

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/8287.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #8287


commit 3d5f0911d6df431273fd2d3753df9c4241a9107c
Author: Davies Liu dav...@databricks.com
Date:   2015-08-18T22:09:52Z

fix decimal precision of division




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9906] [ML] User guide for LogisticRegre...

Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/8197#discussion_r37345734
  
--- Diff: docs/ml-linear-methods.md ---
@@ -118,12 +133,114 @@ lrModel = lr.fit(training)
 print(Weights:  + str(lrModel.weights))
 print(Intercept:  + str(lrModel.intercept))
 {% endhighlight %}
+/div
 
 /div
 
+The `spark.ml` implementation of logistic regression also supports
+extracting a summary of the model over the training set. Note that the
+predictions and metrics which are stored as `Datafram`s in
+`BinaryLogisticRegressionSummary` are annoted `@transient` and hence
+only available on the driver.
+
+div class=codetabs
+
+div data-lang=scala markdown=1
+

+[`LogisticRegressionTrainingSummary`](api/scala/index.html#org.apache.spark.ml.classification.LogisticRegressionTrainingSummary)
+provides a summary for a

+[`LogisticRegressionModel`](api/scala/index.html#org.apache.spark.ml.classification.LogisticRegressionModel).
+Currently, only binary classification is supported and the
+summary must be explicitly cast to

+[`BinaryLogisticRegressionTrainingSummary`](api/scala/index.html#org.apache.spark.ml.classification.BinaryLogisticRegressionTrainingSummary).
+This will likely change when multiclass classification is supported.
+
+Continuing the earlier example:
+
+{% highlight scala %}
+// Extract the summary from the returned LogisticRegressionModel instance 
trained in the earlier example
+val trainingSummary = lrModel.summary
+
+// Obtain the loss per iteration.
+val objectiveHistory = trainingSummary.objectiveHistory
+objectiveHistory.foreach(loss = println(loss))
+
+// Obtain the metrics useful to judge performance on test data.
+// We cast the summary to a BinaryLogisticRegressionSummary since the 
problem is a
+// binary classification problem.
+val binarySummary = 
trainingSummary.asInstanceOf[BinaryLogisticRegressionSummary]
+
+// Obtain the receiver-operating characteristic as a dataframe and 
areaUnderROC.
+val roc = binarySummary.roc
+roc.show()
+roc.select(FPR).show()
+println(binarySummary.areaUnderROC)
+
+// Get the threshold corresponding to the maximum F-Measure and rerun 
LogisticRegression with
+// this selected threshold.
+val fMeasure = binarySummary.fMeasureByThreshold
+val maxFMeasure = fMeasure.select(max(F-Measure)).head().getDouble(0)
+val bestThreshold = fMeasure.where($F-Measure === maxFMeasure).
+  select(threshold).head().getDouble(0)
+logReg.setThreshold(bestThreshold)
+logReg.fit(logRegDataFrame)
+{% endhighlight %}
 /div
 
-### Optimization
+div data-lang=java markdown=1

+[`LogisticRegressionTrainingSummary`](api/java/org/apache/spark/ml/classification/LogisticRegressionTrainingSummary.html)
+provides a summary for a

+[`LogisticRegressionModel`](api/java/org/apache/spark/ml/classification/LogisticRegressionModel.html).
+Currently, only binary classification is supported and the
+summary must be explicitly cast to

+[`BinaryLogisticRegressionTrainingSummary`](api/java/org/apache/spark/ml/classification/BinaryLogisticRegressionTrainingSummary.html).
+This will likely change when multiclass classification is supported.
+
+Continuing the earlier example:
+
+{% highlight java %}
+// Extract the summary from the returned LogisticRegressionModel instance 
trained in the earlier example
+LogisticRegressionTrainingSummary trainingSummary = logRegModel.summary();
+
+// Obtain the loss per iteration.
+double[] objectiveHistory = trainingSummary.objectiveHistory();
+for (double lossPerIteration : objectiveHistory) {
--- End diff --

Nope, see [Google's Java Style 
Guide](https://google.github.io/styleguide/javaguide.html#s4.6.2-horizontal-whitespace)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9906] [ML] User guide for LogisticRegre...

Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/8197#discussion_r37345587
  
--- Diff: docs/ml-linear-methods.md ---
@@ -118,12 +133,114 @@ lrModel = lr.fit(training)
 print(Weights:  + str(lrModel.weights))
 print(Intercept:  + str(lrModel.intercept))
 {% endhighlight %}
+/div
 
 /div
 
+The `spark.ml` implementation of logistic regression also supports
+extracting a summary of the model over the training set. Note that the
+predictions and metrics which are stored as `Datafram`s in
+`BinaryLogisticRegressionSummary` are annoted `@transient` and hence
+only available on the driver.
+
+div class=codetabs
+
+div data-lang=scala markdown=1
+

+[`LogisticRegressionTrainingSummary`](api/scala/index.html#org.apache.spark.ml.classification.LogisticRegressionTrainingSummary)
+provides a summary for a

+[`LogisticRegressionModel`](api/scala/index.html#org.apache.spark.ml.classification.LogisticRegressionModel).
+Currently, only binary classification is supported and the
+summary must be explicitly cast to

+[`BinaryLogisticRegressionTrainingSummary`](api/scala/index.html#org.apache.spark.ml.classification.BinaryLogisticRegressionTrainingSummary).
+This will likely change when multiclass classification is supported.
--- End diff --

Downcasting is almost always an indication of a poor abstraction and IMO 
the stabilized API should not require any explicit typecasting by the end user, 
[here's an 
explanation](http://codebetter.com/jeremymiller/2006/12/26/downcasting-is-a-code-smell/)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10082] [MLlib] Validate i, j in apply D...

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8271#issuecomment-132332983
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41155/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10082] [MLlib] Validate i, j in apply D...

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8271#issuecomment-132332979
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10001] [CORE] Allow Ctrl-C in spark-she...

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8216#issuecomment-132339281
  
  [Test build #41148 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41148/console)
 for   PR 8216 at commit 
[`d3eabf0`](https://github.com/apache/spark/commit/d3eabf026fc4806414131833435e1fd0e868957a).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9772] [PySpark] [ML] Add Python API for...

Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/8102#issuecomment-132345003
  
One more comment: need to add VectorSlicer to list ```__all__``` at top of 
file


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10087] [CORE] Disable spark.shuffle.red...

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8280#issuecomment-132346765
  
  [Test build #41149 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41149/console)
 for   PR 8280 at commit 
[`f77e574`](https://github.com/apache/spark/commit/f77e574dd749c0c140ee71e4aaa143abbfcc6d56).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8473][SPARK-9889][ML] User guide and ex...

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8184#issuecomment-132346686
  
  [Test build #41163 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41163/consoleFull)
 for   PR 8184 at commit 
[`dfb3b2f`](https://github.com/apache/spark/commit/dfb3b2ffe8928142d8e1e96c9a45968056d2336d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8167] Make tasks that fail from YARN pr...

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8007#issuecomment-132348290
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8167] Make tasks that fail from YARN pr...

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8007#issuecomment-132348292
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41152/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9906] [ML] User guide for LogisticRegre...

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8197#issuecomment-132348205
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41162/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8167] Make tasks that fail from YARN pr...

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8007#issuecomment-132348217
  
  [Test build #41152 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41152/console)
 for   PR 8007 at commit 
[`70d6a15`](https://github.com/apache/spark/commit/70d6a1587906210ea4451fb1743b8eda6e7b90c4).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `case class ExecutorNormalExit(`
  * `class ExecutorLossReason(val message: String) extends Serializable `
  * `case class ExecutorExitedNormally(val exitCode: Int, reason: String)`
  * `  case class RemoveExecutor(executorId: String, reason: 
ExecutorLossReason)`
  * `  case class AcknowledgeExecutorRemoved(executorId: String) extends 
CoarseGrainedClusterMessage`
  * `  case class GetExecutorLossReason(executorId: String) extends 
CoarseGrainedClusterMessage`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9906] [ML] User guide for LogisticRegre...

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8197#issuecomment-132348201
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10095] [SQL] use public API of BigInteg...

2015-08-18 Thread srowen

Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/8286#issuecomment-132350637
  
+1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8167] Make tasks that fail from YARN pr...

Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/8007#discussion_r37353473
  
--- Diff: core/src/main/scala/org/apache/spark/TaskEndReason.scala ---
@@ -208,6 +208,22 @@ case class ExecutorLostFailure(execId: String) extends 
TaskFailedReason {
 
 /**
  * :: DeveloperApi ::
+ * The task failed because the executor that it was running on was 
prematurely terminated. The
+ * executor is forcibly exited but the exit should be considered as part 
of normal cluster
+ * behavior.
+ */
+@DeveloperApi
+case class ExecutorNormalExit(
--- End diff --

I'd give the same feedback here, but then `ExecutorLostFailure` is a 
developer API... still I think that a single reason (with a boolean saying 
whether to treat it as an error) would be simpler.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8473][SPARK-9889][ML] User guide and ex...

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8184#issuecomment-132350414
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41163/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10095] [SQL] use public API of BigInteg...

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8286#issuecomment-132351145
  
  [Test build #41164 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41164/consoleFull)
 for   PR 8286 at commit 
[`e547fe8`](https://github.com/apache/spark/commit/e547fe80f59f83fe2b3934215975f9180c5da164).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10082] [MLlib] Validate i, j in apply D...

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8271#issuecomment-132353692
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10082] [MLlib] Validate i, j in apply D...

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8271#issuecomment-132353694
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41159/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10082] [MLlib] Validate i, j in apply D...

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8271#issuecomment-132353469
  
  [Test build #41159 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41159/console)
 for   PR 8271 at commit 
[`a92c382`](https://github.com/apache/spark/commit/a92c38287c273d82d3e22cf35ebc8216f33d0b2d).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10017] [MLlib]: ML model broadcasts sho...

2015-08-18 Thread sabhyankar

Github user sabhyankar commented on the pull request:

https://github.com/apache/spark/pull/8241#issuecomment-132362277
  
Thanks for pointing that out @holdenk ! I have pushed a change to the PR!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10060] [ML] [DOC] spark.ml DecisionTree...

2015-08-18 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/8244#discussion_r37358090
  
--- Diff: docs/ml-decision-tree.md ---
@@ -0,0 +1,506 @@
+---
+layout: global
+title: Decision Trees - SparkML
+displayTitle: a href=ml-guide.htmlML/a - Decision Trees
+---
+
+**Table of Contents**
+
+* This will become a table of contents (this text will be scraped).
+{:toc}
+
+
+# Overview
+
+[Decision trees](http://en.wikipedia.org/wiki/Decision_tree_learning)
+and their ensembles are popular methods for the machine learning tasks of
+classification and regression. Decision trees are widely used since they 
are easy to interpret,
+handle categorical features, extend to the multiclass classification 
setting, do not require
+feature scaling, and are able to capture non-linearities and feature 
interactions. Tree ensemble
+algorithms such as random forests and boosting are among the top 
performers for classification and
+regression tasks.
+
+MLlib supports decision trees for binary and multiclass classification and 
for regression,
+using both continuous and categorical features. The implementation 
partitions data by rows,
+allowing distributed training with millions or even billions of instances.
+
+Users can find more information about the decision tree algorithm in the 
[MLlib Decision Tree guide](mllib-decision-tree.html).  In this section, we 
demonstrate the Pipelines API for Decision Trees.
+
+The Pipelines API for Decision Trees offers a bit more functionality than 
the original API.  In particular, for classification, users can get the 
predicted probability of each class (a.k.a. class conditional probabilities).
+
+Ensembles of trees (Random Forests and Gradient-Boosted Trees) are 
described in the [Ensembles guide](ml-ensembles.html).
+
+# Inputs and Outputs (Predictions)
+
+We list the input and output (prediction) column types here.
+All output columns are optional; to exclude an output column, set its 
corresponding Param to an empty string.
+
+## Input Columns
+
+table class=table
+  thead
+tr
+  th align=leftParam name/th
+  th align=leftType(s)/th
+  th align=leftDefault/th
+  th align=leftDescription/th
+/tr
+  /thead
+  tbody
+tr
+  tdlabelCol/td
+  tdDouble/td
+  tdlabel/td
+  tdLabel to predict/td
+/tr
+tr
+  tdfeaturesCol/td
+  tdVector/td
+  tdfeatures/td
+  tdFeature vector/td
+/tr
+  /tbody
+/table
+
+## Output Columns
+
+table class=table
+  thead
+tr
+  th align=leftParam name/th
+  th align=leftType(s)/th
+  th align=leftDefault/th
+  th align=leftDescription/th
+  th align=leftNotes/th
+/tr
+  /thead
+  tbody
+tr
+  tdpredictionCol/td
+  tdDouble/td
+  tdprediction/td
+  tdPredicted label/td
+  td/td
+/tr
+tr
+  tdrawPredictionCol/td
+  tdVector/td
+  tdrawPrediction/td
+  tdVector of length # classes, with the counts of training instance 
labels at the tree node which makes the prediction/td
+  tdClassification only/td
+/tr
+tr
+  tdprobabilityCol/td
+  tdVector/td
+  tdprobability/td
+  tdVector of length # classes equal to rawPrediction normalized to 
a multinomial distribution/td
+  tdClassification only/td
+/tr
+  /tbody
+/table
+
+# Examples
+
+The below examples demonstrate the Pipelines API for Decision Trees. The 
main differences between this API and the [original MLlib Decision Tree 
API](mllib-decision-tree.html) are:
+
+* support for ML Pipelines
+* separation of Decision Trees for classification vs. regression
+* use of DataFrame metadata to distinguish continuous and categorical 
features
+
+
+## Classification
+
+The following examples load a dataset in LibSVM format, split it into 
training and test sets, train on the first dataset, and then evaluate on the 
held-out test set.
+We use two feature transformers to prepare the data; these help index 
categories for the label and categorical features, adding metadata to the 
`DataFrame` which the Decision Tree algorithm can recognize.
+
+div class=codetabs
+div data-lang=scala markdown=1
+
+More details on parameters can be found in the [Scala API 
documentation](api/scala/index.html#org.apache.spark.ml.classification.DecisionTreeClassifier).
+
+{% highlight scala %}
+import org.apache.spark.ml.Pipeline
+import org.apache.spark.ml.classification.DecisionTreeClassifier
+import

[GitHub] spark pull request: [SPARK-5754] [yarn] Spark/Yarn/Windows driver/...

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8053#issuecomment-132367967
  
  [Test build #41157 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41157/console)
 for   PR 8053 at commit 
[`5be3e44`](https://github.com/apache/spark/commit/5be3e449aa0306c41398408157a7db1cd94f1aa8).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5754] [yarn] Spark/Yarn/Windows driver/...

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8053#issuecomment-132368084
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41157/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9439] [yarn] External shuffle service r...

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7943#issuecomment-132333862
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9906] [ML] User guide for LogisticRegre...

2015-08-18 Thread MechCoder

Github user MechCoder commented on a diff in the pull request:

https://github.com/apache/spark/pull/8197#discussion_r37347155
  
--- Diff: docs/ml-linear-methods.md ---
@@ -118,12 +133,114 @@ lrModel = lr.fit(training)
 print(Weights:  + str(lrModel.weights))
 print(Intercept:  + str(lrModel.intercept))
 {% endhighlight %}
+/div
 
 /div
 
+The `spark.ml` implementation of logistic regression also supports
+extracting a summary of the model over the training set. Note that the
+predictions and metrics which are stored as `Datafram`s in
+`BinaryLogisticRegressionSummary` are annoted `@transient` and hence
+only available on the driver.
+
+div class=codetabs
+
+div data-lang=scala markdown=1
+

+[`LogisticRegressionTrainingSummary`](api/scala/index.html#org.apache.spark.ml.classification.LogisticRegressionTrainingSummary)
+provides a summary for a

+[`LogisticRegressionModel`](api/scala/index.html#org.apache.spark.ml.classification.LogisticRegressionModel).
+Currently, only binary classification is supported and the
+summary must be explicitly cast to

+[`BinaryLogisticRegressionTrainingSummary`](api/scala/index.html#org.apache.spark.ml.classification.BinaryLogisticRegressionTrainingSummary).
+This will likely change when multiclass classification is supported.
+
+Continuing the earlier example:
+
+{% highlight scala %}
+// Extract the summary from the returned LogisticRegressionModel instance 
trained in the earlier example
+val trainingSummary = lrModel.summary
+
+// Obtain the loss per iteration.
+val objectiveHistory = trainingSummary.objectiveHistory
+objectiveHistory.foreach(loss = println(loss))
+
+// Obtain the metrics useful to judge performance on test data.
+// We cast the summary to a BinaryLogisticRegressionSummary since the 
problem is a
+// binary classification problem.
+val binarySummary = 
trainingSummary.asInstanceOf[BinaryLogisticRegressionSummary]
+
+// Obtain the receiver-operating characteristic as a dataframe and 
areaUnderROC.
+val roc = binarySummary.roc
+roc.show()
+roc.select(FPR).show()
+println(binarySummary.areaUnderROC)
+
+// Get the threshold corresponding to the maximum F-Measure and rerun 
LogisticRegression with
+// this selected threshold.
+val fMeasure = binarySummary.fMeasureByThreshold
+val maxFMeasure = fMeasure.select(max(F-Measure)).head().getDouble(0)
+val bestThreshold = fMeasure.where($F-Measure === maxFMeasure).
+  select(threshold).head().getDouble(0)
+logReg.setThreshold(bestThreshold)
+logReg.fit(logRegDataFrame)
+{% endhighlight %}
 /div
 
-### Optimization
+div data-lang=java markdown=1

+[`LogisticRegressionTrainingSummary`](api/java/org/apache/spark/ml/classification/LogisticRegressionTrainingSummary.html)
+provides a summary for a

+[`LogisticRegressionModel`](api/java/org/apache/spark/ml/classification/LogisticRegressionModel.html).
+Currently, only binary classification is supported and the
+summary must be explicitly cast to

+[`BinaryLogisticRegressionTrainingSummary`](api/java/org/apache/spark/ml/classification/BinaryLogisticRegressionTrainingSummary.html).
+This will likely change when multiclass classification is supported.
+
+Continuing the earlier example:
+
+{% highlight java %}
+// Extract the summary from the returned LogisticRegressionModel instance 
trained in the earlier example
+LogisticRegressionTrainingSummary trainingSummary = logRegModel.summary();
+
+// Obtain the loss per iteration.
+double[] objectiveHistory = trainingSummary.objectiveHistory();
+for (double lossPerIteration : objectiveHistory) {
--- End diff --

I see then other places such as this 
(https://github.com/apache/spark/blob/master/mllib/src/test/java/org/apache/spark/ml/clustering/JavaKMeansSuite.java#L68)
 have to be changed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10070] [DOCS] Remove Guava dependencies...

Github user feynmanliang commented on the pull request:

https://github.com/apache/spark/pull/8272#issuecomment-132335749
  
LGTM

CC @mengxr 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10093][SQL] Avoid transformation on exe...

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8285#issuecomment-132340941
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2788] [STREAMING] Add location filterin...

2015-08-18 Thread dmvieira

Github user dmvieira commented on the pull request:

https://github.com/apache/spark/pull/1717#issuecomment-132340568
  
I'm starting a third-party package as suggested by @srowen and I hope you 
enjoy. Feel free to collaborate: 
https://github.com/dmvieira/spark-twitter-stream-receiver


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10093][SQL] Avoid transformation on exe...

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8285#issuecomment-132344077
  
  [Test build #41161 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41161/consoleFull)
 for   PR 8285 at commit 
[`e8b8240`](https://github.com/apache/spark/commit/e8b8240d389782bfc0e75cbe1797ce5aecc47092).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10080][SQL] Fix binary incompatibility ...

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8281#issuecomment-132344119
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41150/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8505][SparkR] Add settings to kick `lin...

2015-08-18 Thread shaneknapp

Github user shaneknapp commented on the pull request:

https://github.com/apache/spark/pull/7883#issuecomment-132343883
  
found one directory on amp-jenkins-worker-01 that's polluted -- deleting it
now, and this should fix any builds that run there.

On Mon, Aug 17, 2015 at 9:36 PM, shane knapp â  incompl...@gmail.com 
wrote:

 On Mon, Aug 17, 2015 at 10:11 AM, Shivaram Venkataraman 
 notificati...@github.com wrote:

 @JoshRosen https://github.com/JoshRosen There seems to be some problem
 on some of the Jenkins workers and we get errors which look like

 running git clean -fdx
 warning: failed to remove 'target/'
 Removing target/
 Build step 'Execute shell' marked build as failure

 I've seen this in other PRs as well -- Any ideas what is causing this ?


 somehow the spark builds are creating directories w/the wrong permissions
 (missing the owner write bit), meaning that the directory created from a
 previous build can't be deleted and thereby fails the build.

 i'll go through all of the workers/spark build dirs first thing tomorrow
 and fix this.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10080][SQL] Fix binary incompatibility ...

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8281#issuecomment-132343931
  
  [Test build #41150 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41150/console)
 for   PR 8281 at commit 
[`541d9a0`](https://github.com/apache/spark/commit/541d9a016b125a3fbbef5cdf97ee3ff9db78b8a0).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `implicit class StringToColumn(val sc: StringContext) `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10087] [CORE] Disable spark.shuffle.red...

2015-08-18 Thread shivaram

Github user shivaram commented on the pull request:

https://github.com/apache/spark/pull/8280#issuecomment-132345128
  
The diff I'm proposing is something like 
```
+val numShuffleDeps = 
rdd.dependencies.filter(_.isInstanceOf[ShuffleDependency[_, _, _]]).length
+
 // If the RDD has shuffle dependencies and shuffle locality is 
enabled, pick locations that
 // have at least REDUCER_PREF_LOCS_FRACTION of data as preferred 
locations
-if (shuffleLocalityEnabled  rdd.partitions.length  
SHUFFLE_PREF_REDUCE_THRESHOLD) {
+if (numShuffleDeps == 1  shuffleLocalityEnabled 
+rdd.partitions.length  SHUFFLE_PREF_REDUCE_THRESHOLD) {
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9833] [yarn] Add options to disable del...

Github user vanzin commented on the pull request:

https://github.com/apache/spark/pull/8134#issuecomment-132347459
  
@tgravescs I chose a slightly different name than you suggested, how does 
that sound?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10095] [SQL] use public API of BigInteg...

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8286#issuecomment-132349838
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8167] Make tasks that fail from YARN pr...

Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/8007#discussion_r37353128
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/ExecutorLossReason.scala ---
@@ -23,13 +23,29 @@ import org.apache.spark.executor.ExecutorExitCode
  * Represents an explanation for a executor or whole slave failing or 
exiting.
  */
 private[spark]
-class ExecutorLossReason(val message: String) {
+class ExecutorLossReason(val message: String) extends Serializable {
   override def toString: String = message
 }
 
+private[spark] case class ExecutorExitedAbnormally(val exitCode: Int, 
reason: String)
+  extends ExecutorLossReason(reason) {
+}
+
+private[spark] object ExecutorExitedAbnormally {
+  def apply(exitCode: Int): ExecutorExitedAbnormally = {
+ExecutorExitedAbnormally(exitCode, 
ExecutorExitCode.explainExitCode(exitCode))
+  }
+}
+
 private[spark]
-case class ExecutorExited(val exitCode: Int)
-  extends ExecutorLossReason(ExecutorExitCode.explainExitCode(exitCode)) {
+case class ExecutorExitedNormally(val exitCode: Int, reason: String)
+  extends ExecutorLossReason(reason) {
+}
+
+private[spark] object ExecutorExitedNormally {
--- End diff --

I don't know, I find `ExecutorExitedAbnormally` and 
`ExecutorExitedNormally` a little confusing, since internally they hold exactly 
the same data (even the same reason message). What if there was only 
`ExecutorExited` with a parameter saying whether it should be treated as an 
error or not?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10095] [SQL] use public API of BigInteg...

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8286#issuecomment-132349770
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8167] Make tasks that fail from YARN pr...

Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/8007#discussion_r37353932
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala ---
@@ -809,9 +825,14 @@ private[spark] class TaskSetManager(
 }
   }
 }
-// Also re-enqueue any tasks that were running on the node
 for ((tid, info) - taskInfos if info.running  info.executorId == 
execId) {
-  handleFailedTask(tid, TaskState.FAILED, ExecutorLostFailure(execId))
+  // Also re-enqueue any tasks that were running on the node
+  val executorFailureReason = reason match {
+case exited: ExecutorExitedNormally =
--- End diff --

This would go away if you follow my suggestion of merging the two errors, 
but in any case, since it's a case class:

case ExecutorExitedNormally(exitCode, reason) =




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8167] Make tasks that fail from YARN pr...

Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/8007#discussion_r37355106
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala
 ---
@@ -91,6 +92,68 @@ private[spark] abstract class YarnSchedulerBackend(
   }
 
   /**
+   * Override the DriverEndpoint to add extra logic for the case when an 
executor is disconnected.
+   * We should check the cluster manager and find if the loss of the 
executor was caused by YARN
+   * force killing it due to preemption.
+   */
+  private class YarnDriverEndpoint(rpcEnv: RpcEnv, sparkProperties: 
ArrayBuffer[(String, String)])
+  extends DriverEndpoint(rpcEnv, sparkProperties) {
+
+private val pendingDisconnectedExecutors = new HashSet[String]
+private val handleDisconnectedExecutorThreadPool =
+  
ThreadUtils.newDaemonCachedThreadPool(yarn-driver-handle-lost-executor-thread-pool)
+
+/**
+ * When onDisconnected is received at the driver endpoint, the 
superclass DriverEndpoint
+ * handles it by assuming the Executor was lost for a bad reason and 
removes the executor
+ * immediately.
+ *
+ * In YARN's case however it is crucial to talk to the application 
master and ask why the
+ * executor had exited. In particular, the executor may have exited 
due to the executor
+ * having been preempted. If the executor exited normally according 
to the application
+ * master then we pass that information down to the TaskSetManager to 
inform the
+ * TaskSetManager that tasks on that lost executor should not count 
towards a job failure.
+ */
+override def onDisconnected(rpcAddress: RpcAddress): Unit = {
+  addressToExecutorId.get(rpcAddress).foreach({ executorId =
+// onDisconnected could be fired multiple times from the same 
executor while we're
+// asynchronously contacting the AM. So keep track of the 
executors we're trying to
+// find loss reasons for and don't duplicate the work
+if (!pendingDisconnectedExecutors.contains(executorId)) {
+  pendingDisconnectedExecutors.add(executorId)
+  handleDisconnectedExecutorThreadPool.submit(new Runnable() {
+override def run(): Unit = {
+  val executorLossReason =
+  // Check for the loss reason and pass the loss reason to 
driverEndpoint
+
yarnSchedulerEndpoint.askWithRetry[Option[ExecutorLossReason]](
+GetExecutorLossReason(executorId))
+  executorLossReason match {
+case Some(reason) =
+  
driverEndpoint.askWithRetry[Boolean](RemoveExecutor(executorId, reason))
+case None =
+  logWarning(sAttempted to get executor loss reason +
+s for $rpcAddress but got no response. Marking as 
slave lost.)
+  
driverEndpoint.askWithRetry[Boolean](RemoveExecutor(executorId, SlaveLost()))
--- End diff --

Can you call `super.removeExecutor()` directly here, instead of doing the 
round-trip through the RPC layer? (Might need to check whether that method is 
thread-safe.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10072][STREAMING] BlockGenerator can de...

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8257#issuecomment-132355177
  
  [Test build #1651 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1651/console)
 for   PR 8257 at commit 
[`cb7bba2`](https://github.com/apache/spark/commit/cb7bba2f3ba1f3af87a55f2fc4f38da142099206).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10088] [sql] Add support for stored as...

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8282#issuecomment-132355127
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10088] [sql] Add support for stored as...

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8282#issuecomment-132355128
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41153/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9782] [YARN] Support YARN application t...

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/8072


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10089] [sql] Add missing golden files.

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/8283


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10088] [sql] Add support for stored as...

2015-08-18 Thread marmbrus

Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/8282#issuecomment-132364808
  
Thanks!  Merging to master and 1.5.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8167] Make tasks that fail from YARN pr...

Github user mccheah commented on a diff in the pull request:

https://github.com/apache/spark/pull/8007#discussion_r37357691
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala
 ---
@@ -91,6 +92,66 @@ private[spark] abstract class YarnSchedulerBackend(
   }
 
   /**
+   * Override the DriverEndpoint to add extra logic for the case when an 
executor is disconnected.
+   * We should check the cluster manager and find if the loss of the 
executor was caused by YARN
+   * force killing it due to preemption.
+   */
+  private class YarnDriverEndpoint(rpcEnv: RpcEnv, sparkProperties: 
ArrayBuffer[(String, String)])
+  extends DriverEndpoint(rpcEnv, sparkProperties) {
+
+private val pendingDisconnectedExecutors = new HashSet[String]
+private val handleDisconnectedExecutorThreadPool =
+  
ThreadUtils.newDaemonCachedThreadPool(yarn-driver-handle-lost-executor-thread-pool)
+
+/**
+ * When onDisconnected is received at the driver endpoint, the 
superclass DriverEndpoint
+ * handles it by assuming the Executor was lost for a bad reason and 
removes the executor
+ * immediately.
+ *
+ * In YARN's case however it is crucial to talk to the application 
master and ask why the
+ * executor had exited. In particular, the executor may have exited 
due to the executor
+ * having been preempted. If the executor exited normally according 
to the application
+ * master then we pass that information down to the TaskSetManager to 
inform the
+ * TaskSetManager that tasks on that lost executor should not count 
towards a job failure.
+ */
+override def onDisconnected(rpcAddress: RpcAddress): Unit = {
--- End diff --

I guess there will be wasted work in the sense that the tasks will get 
allocated to the bad executor and then the executor will be removed and all of 
those tasks are relocated to the healthy ones. That's probably fine from a 
correctness standpoint but might create a bit of a performance latency... I'm 
open to the discussion of doing another architecture overhaul to get the 
soft-unregistration construct done.

The other thing I'm wondering is if it's even worth offloading this 
communicate-with-AM logic to be asynchronous at all. How big of a performance 
penalty would it be to block the event loop with the request to the AM for the 
get-executor-loss-reason? I presumed that it was unacceptable to do that 
blocking request on the main event loop though.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10072][STREAMING] BlockGenerator can de...

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8257#issuecomment-132368265
  
  [Test build #1653 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1653/consoleFull)
 for   PR 8257 at commit 
[`cb7bba2`](https://github.com/apache/spark/commit/cb7bba2f3ba1f3af87a55f2fc4f38da142099206).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5754] [yarn] Spark/Yarn/Windows driver/...

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8053#issuecomment-132368081
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9893] user guide for VectorSlicer

Github user feynmanliang commented on the pull request:

https://github.com/apache/spark/pull/8267#issuecomment-132333738
  
@yinxusen Sorry, I think there's some merge conflicts. Do you mind rebasing 
master?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9439] [yarn] External shuffle service r...

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7943#issuecomment-132333839
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9893] user guide for VectorSlicer

Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/8267#discussion_r37346725
  
--- Diff: docs/ml-features.md ---
@@ -1389,3 +1389,145 @@ print(output.select(features, clicked).first())
 
 # Feature Selectors
 
+## VectorSlicer
+
+`VectorSlicer` is a transformer that takes a feature vector and outputs a 
new feature vector with a sub-array of the original features. It is useful for 
extracting features from a vector column.
+
+`VectorSlicer` accepts a vector column with a specified indices, then 
outputs a new vector column whose values are selected via those indices. There 
are two types of indices, 
+
+ 1. Integer indices that represents the real indices in the vector, 
`setIndices()`;
--- End diff --

I would remove the word real (i.e. ...that represent the indices into 
the vector) since it could be confused for real numbers (i.e. real-valued 
indices, which don't really make sense)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10080][SQL] Fix binary incompatibility ...

2015-08-18 Thread andrewor14

Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/8281#issuecomment-132336754
  
LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10001] [CORE] Allow Ctrl-C in spark-she...

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8216#issuecomment-132339598
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41148/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10001] [CORE] Allow Ctrl-C in spark-she...

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8216#issuecomment-132339594
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9906] [ML] User guide for LogisticRegre...

Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/8197#discussion_r37347701
  
--- Diff: docs/ml-linear-methods.md ---
@@ -118,12 +133,114 @@ lrModel = lr.fit(training)
 print(Weights:  + str(lrModel.weights))
 print(Intercept:  + str(lrModel.intercept))
 {% endhighlight %}
+/div
 
 /div
 
+The `spark.ml` implementation of logistic regression also supports
+extracting a summary of the model over the training set. Note that the
+predictions and metrics which are stored as `Datafram`s in
+`BinaryLogisticRegressionSummary` are annoted `@transient` and hence
+only available on the driver.
+
+div class=codetabs
+
+div data-lang=scala markdown=1
+

+[`LogisticRegressionTrainingSummary`](api/scala/index.html#org.apache.spark.ml.classification.LogisticRegressionTrainingSummary)
+provides a summary for a

+[`LogisticRegressionModel`](api/scala/index.html#org.apache.spark.ml.classification.LogisticRegressionModel).
+Currently, only binary classification is supported and the
+summary must be explicitly cast to

+[`BinaryLogisticRegressionTrainingSummary`](api/scala/index.html#org.apache.spark.ml.classification.BinaryLogisticRegressionTrainingSummary).
+This will likely change when multiclass classification is supported.
+
+Continuing the earlier example:
+
+{% highlight scala %}
+// Extract the summary from the returned LogisticRegressionModel instance 
trained in the earlier example
+val trainingSummary = lrModel.summary
+
+// Obtain the loss per iteration.
+val objectiveHistory = trainingSummary.objectiveHistory
+objectiveHistory.foreach(loss = println(loss))
+
+// Obtain the metrics useful to judge performance on test data.
+// We cast the summary to a BinaryLogisticRegressionSummary since the 
problem is a
+// binary classification problem.
+val binarySummary = 
trainingSummary.asInstanceOf[BinaryLogisticRegressionSummary]
+
+// Obtain the receiver-operating characteristic as a dataframe and 
areaUnderROC.
+val roc = binarySummary.roc
+roc.show()
+roc.select(FPR).show()
+println(binarySummary.areaUnderROC)
+
+// Get the threshold corresponding to the maximum F-Measure and rerun 
LogisticRegression with
+// this selected threshold.
+val fMeasure = binarySummary.fMeasureByThreshold
+val maxFMeasure = fMeasure.select(max(F-Measure)).head().getDouble(0)
+val bestThreshold = fMeasure.where($F-Measure === maxFMeasure).
+  select(threshold).head().getDouble(0)
+logReg.setThreshold(bestThreshold)
+logReg.fit(logRegDataFrame)
+{% endhighlight %}
 /div
 
-### Optimization
+div data-lang=java markdown=1

+[`LogisticRegressionTrainingSummary`](api/java/org/apache/spark/ml/classification/LogisticRegressionTrainingSummary.html)
+provides a summary for a

+[`LogisticRegressionModel`](api/java/org/apache/spark/ml/classification/LogisticRegressionModel.html).
+Currently, only binary classification is supported and the
+summary must be explicitly cast to

+[`BinaryLogisticRegressionTrainingSummary`](api/java/org/apache/spark/ml/classification/BinaryLogisticRegressionTrainingSummary.html).
+This will likely change when multiclass classification is supported.
+
+Continuing the earlier example:
+
+{% highlight java %}
+// Extract the summary from the returned LogisticRegressionModel instance 
trained in the earlier example
+LogisticRegressionTrainingSummary trainingSummary = logRegModel.summary();
+
+// Obtain the loss per iteration.
+double[] objectiveHistory = trainingSummary.objectiveHistory();
+for (double lossPerIteration : objectiveHistory) {
--- End diff --

Perhaps... the [Spark style 
guide](https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide)
 only covers Scala so I was just going by past experience on this change. We 
could try to get a Java style guide in if there's a community need for it


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10082] [MLlib] Validate i, j in apply D...

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8271#issuecomment-132339896
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10093][SQL] Avoid transformation on exe...

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8285#issuecomment-132340973
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10080][SQL] Fix binary incompatibility ...

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8281#issuecomment-132344114
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9906] [ML] User guide for LogisticRegre...

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8197#issuecomment-132344168
  
  [Test build #41162 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41162/consoleFull)
 for   PR 8197 at commit 
[`7bf922c`](https://github.com/apache/spark/commit/7bf922c53b0e7f6e6d5304107f432b58ad7b93c7).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10080][SQL] Fix binary incompatibility ...

2015-08-18 Thread marmbrus

Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/8281#issuecomment-132347116
  
Merged to master and 1.5


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10080][SQL] Fix binary incompatibility ...

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/8281


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10095] [SQL] use public API of BigInteg...

2015-08-18 Thread davies

GitHub user davies opened a pull request:

https://github.com/apache/spark/pull/8286

[SPARK-10095] [SQL] use public API of BigInteger

In UnsafeRow, we use the private field of BigInteger for better 
performance, but it actually didn't contribute much (3% in one benchmark) to 
end-to-end runtime, and make it not portable (may fail on other JVM 
implementations).

So we should use the public API instead.

cc @rxin 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/davies/spark portable_decimal

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/8286.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #8286


commit e547fe80f59f83fe2b3934215975f9180c5da164
Author: Davies Liu dav...@databricks.com
Date:   2015-08-18T20:59:58Z

use public API of BigInteger




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8167] Make tasks that fail from YARN pr...

Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/8007#discussion_r37354476
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala
 ---
@@ -91,6 +92,51 @@ private[spark] abstract class YarnSchedulerBackend(
   }
 
   /**
+   * Override the DriverEndpoint to add extra logic for the case when an 
executor is disconnected.
+   * We should check the cluster manager and find if the loss of the 
executor was caused by YARN
+   * force killing it due to preemption.
+   */
+  private class YarnDriverEndpoint(rpcEnv: RpcEnv, sparkProperties: 
ArrayBuffer[(String, String)])
--- End diff --

Wait, I'm confused. `YarnSchedulerEndpoint` and `DriverEndpoint` are 
different things; `YarnSchedulerEndpoint` is a communication channel between 
the driver in YARN mode and the YARN AM, there are no executors involved. Why 
wouldn't that work here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8167] Make tasks that fail from YARN pr...

Github user mccheah commented on a diff in the pull request:

https://github.com/apache/spark/pull/8007#discussion_r37355322
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala
 ---
@@ -91,6 +92,68 @@ private[spark] abstract class YarnSchedulerBackend(
   }
 
   /**
+   * Override the DriverEndpoint to add extra logic for the case when an 
executor is disconnected.
+   * We should check the cluster manager and find if the loss of the 
executor was caused by YARN
+   * force killing it due to preemption.
+   */
+  private class YarnDriverEndpoint(rpcEnv: RpcEnv, sparkProperties: 
ArrayBuffer[(String, String)])
+  extends DriverEndpoint(rpcEnv, sparkProperties) {
+
+private val pendingDisconnectedExecutors = new HashSet[String]
+private val handleDisconnectedExecutorThreadPool =
+  
ThreadUtils.newDaemonCachedThreadPool(yarn-driver-handle-lost-executor-thread-pool)
+
+/**
+ * When onDisconnected is received at the driver endpoint, the 
superclass DriverEndpoint
+ * handles it by assuming the Executor was lost for a bad reason and 
removes the executor
+ * immediately.
+ *
+ * In YARN's case however it is crucial to talk to the application 
master and ask why the
+ * executor had exited. In particular, the executor may have exited 
due to the executor
+ * having been preempted. If the executor exited normally according 
to the application
+ * master then we pass that information down to the TaskSetManager to 
inform the
+ * TaskSetManager that tasks on that lost executor should not count 
towards a job failure.
+ */
+override def onDisconnected(rpcAddress: RpcAddress): Unit = {
+  addressToExecutorId.get(rpcAddress).foreach({ executorId =
+// onDisconnected could be fired multiple times from the same 
executor while we're
+// asynchronously contacting the AM. So keep track of the 
executors we're trying to
+// find loss reasons for and don't duplicate the work
+if (!pendingDisconnectedExecutors.contains(executorId)) {
+  pendingDisconnectedExecutors.add(executorId)
+  handleDisconnectedExecutorThreadPool.submit(new Runnable() {
+override def run(): Unit = {
+  val executorLossReason =
+  // Check for the loss reason and pass the loss reason to 
driverEndpoint
+
yarnSchedulerEndpoint.askWithRetry[Option[ExecutorLossReason]](
+GetExecutorLossReason(executorId))
+  executorLossReason match {
+case Some(reason) =
+  
driverEndpoint.askWithRetry[Boolean](RemoveExecutor(executorId, reason))
+case None =
+  logWarning(sAttempted to get executor loss reason +
+s for $rpcAddress but got no response. Marking as 
slave lost.)
+  
driverEndpoint.askWithRetry[Boolean](RemoveExecutor(executorId, SlaveLost()))
--- End diff --

Definitely don't think that's thread safe. It touches things like 
addressToExecutorId, which as we can see in the onDisconnected method itself is 
accessed in the event loop.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8167] Make tasks that fail from YARN pr...

Github user mccheah commented on a diff in the pull request:

https://github.com/apache/spark/pull/8007#discussion_r37356390
  
--- Diff: 
yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala ---
@@ -207,6 +211,17 @@ private[yarn] class YarnAllocator(
   }
 
   /**
+   * Gets the executor loss reason for a disconnected executor.
+   * Note that this method is expected to be called exactly once per 
executor ID.
+   */
+  def getExecutorLossReason(executorId: String): ExecutorLossReason = 
synchronized {
+allocateResources()
+// Expect to be asked for a loss reason once and exactly once.
+assert(completedExecutorExitReasons.contains(executorId))
--- End diff --

That's up to the YARN daemons - I'm going off of the assumption that the 
AMRM client will always report the most up to date status about containers. If 
this isn't necessarily true then we should revisit this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8167] Make tasks that fail from YARN pr...

2015-08-18 Thread markgrover

Github user markgrover commented on the pull request:

https://github.com/apache/spark/pull/8007#issuecomment-132367263

Actually @markgrover can you describe in more detail how you were trying
to use GetExecutorLossReason?

So, I uploaded my
[YARN](https://gist.github.com/markgrover/bb816fe8556e9498871a) and
[driver](https://gist.github.com/markgrover/3c4ef4e3a7823864bbd8) logs.

See
[here](https://gist.github.com/markgrover/bb816fe8556e9498871a#file-yarn-log-L92)
as to how the Executor Loss Reason is being asked for twice. I asserted that
it was only requested once by looking at the driver log. You can do so too by
searching for *Requesting loss reason for executorId: 2* in the [full driver
log](https://gist.githubusercontent.com/markgrover/3c4ef4e3a7823864bbd8/raw/ec8793874b8ff0545fd61f6bdc6e7f0681f9de1c/driver.log).

The relevant event receiving code snippet is
[here](https://github.com/markgrover/spark/blob/5a90c9926a396cf6b0b68ee3fabbfc67ae07dcf7/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala#L213)
and the relevant event sending code snippet is
[here](https://github.com/markgrover/spark/blob/5a90c9926a396cf6b0b68ee3fabbfc67ae07dcf7/core/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala#L119).

This is all before merging your latest change since I was trying this out
last Friday. I wasn't able to figure out then why 2 events were being received.

Anyways, I realize this is a lot of info and perhaps, hard to go through. I
will poke at it more, perhaps, merge your latest commit and see if that helps.
I am envisioning some pain given that your and my pull requests are sharing
some code. If it continues for much longer, may be we should work off of the
same branch. I personally am indifferent to whether I rebase on yours, or
vice-versa.

I always also hoping to find a spark IRC channel where we could collaborate
real time but couldn't. I would definitely be open to hacking on this in a more
collaborative way if you think it'd help (I think it would).

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10072][STREAMING] BlockGenerator can de...

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8257#issuecomment-132369841
  
  [Test build #1652 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1652/console)
 for   PR 8257 at commit 
[`cb7bba2`](https://github.com/apache/spark/commit/cb7bba2f3ba1f3af87a55f2fc4f38da142099206).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10004] [shuffle] Perform auth checks wh...

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8218#issuecomment-132369820
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41156/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8918] [MLLIB] [DOC] Add @since tags to ...

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8288#issuecomment-132372147
  
  [Test build #41166 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41166/consoleFull)
 for   PR 8288 at commit 
[`72fdeb6`](https://github.com/apache/spark/commit/72fdeb64630470f6f46cf3eed8ffbfe83a7c4659).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10017] [MLlib]: ML model broadcasts sho...

2015-08-18 Thread sabhyankar

Github user sabhyankar commented on the pull request:

https://github.com/apache/spark/pull/8241#issuecomment-132374365
  
@holdenk Not sure if you are reviewing the other PRs, but the fix should 
now be in all of them. Thx!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10098][STREAMING][TEST] Cleanup active ...

2015-08-18 Thread tdas

GitHub user tdas opened a pull request:

https://github.com/apache/spark/pull/8289

[SPARK-10098][STREAMING][TEST] Cleanup active context after test in 
FailureSuite

Failures in streaming.FailureSuite can leak StreamingContext and 
SparkContext which fails all subsequent tests

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tdas/spark SPARK-10098

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/8289.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #8289


commit 3545d1efb7c46a23dbb13dae0d65a5c8d3e9aca6
Author: Tathagata Das tathagata.das1...@gmail.com
Date:   2015-08-18T22:22:07Z

Cleanup active contexts after test




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8473][SPARK-9889][ML] User guide and ex...

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/8184#discussion_r37361712
  
--- Diff: docs/ml-features.md ---
@@ -649,6 +649,77 @@ for expanded in polyDF.select(polyFeatures).take(3):
 /div
 /div
 
+## Discrete Cosine Transform (DCT)
+
+The [Discrete Cosine
+Transform](https://en.wikipedia.org/wiki/Discrete_cosine_transform)
+transforms a length $N$ real-valued sequence in the time domain into
+another length $N$ real-valued sequence in the frequency domain. A
+[DCT](api/scala/index.html#org.apache.spark.ml.feature.DCT) class
+provides this functionality, implementing the
+[DCT-II](https://en.wikipedia.org/wiki/Discrete_cosine_transform#DCT-II)
+and scaling the result by $1/\sqrt{2}$ such that the representing matrix
+for the transform is unitary. No shift is applied to the transformed
+sequence (e.g. the $0$th element of the transformed sequence is the
+$0$th DCT coefficient and _not_ the $N/2$th).
+
+div class=codetabs
+div data-lang=scala markdown=1
+{% highlight scala %}
+import org.apache.spark.ml.feature.DCT
+import org.apache.spark.mllib.linalg.Vectors
+
+val data = Seq(
+  Vectors.dense(0.0, 1.0, -2.0, 3.0),
+  Vectors.dense(-1.0, 2.0, 4.0, -7.0),
+  Vectors.dense(14.0, -2.0, -5.0, 1.0))
+val df = 
sqlContext.createDataFrame(data.map(Tuple1.apply)).toDF(features)
+val dct = new DCT()
+  .setInputCol(features)
+  .setOutputCol(featuresDCT)
+  .setInverse(false)
+val dctDf = dct.transform(df)
+dctDf.select(featuresDCT).show(3)
+{% endhighlight %}
+/div
+
+div data-lang=java markdown=1
+{% highlight java %}
+import java.util.Arrays;
+
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.ml.feature.DCT;
+import org.apache.spark.mllib.linalg.Vector;
+import org.apache.spark.mllib.linalg.VectorUDT;
+import org.apache.spark.mllib.linalg.Vectors;
+import org.apache.spark.sql.DataFrame;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.RowFactory;
+import org.apache.spark.sql.SQLContext;
+import org.apache.spark.sql.types.Metadata;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+
+JavaRDDRow data = jsc.parallelize(Arrays.asList(
+  RowFactory.create(Vectors.dense(0.0, 1.0, -2.0, 3.0)),
+  RowFactory.create(Vectors.dense(-1.0, 2.0, 4.0, -7.0)),
+  RowFactory.create(Vectors.dense(14.0, -2.0, -5.0, 1.0))
+));
+StructType schema = new StructType(new StructField[] {
+  new StructField(features, new VectorUDT(), false, Metadata.empty()),
+});
+DataFrame df = jsql.createDataFrame(data, schema);
+DCT dct = new DCT()
+  .setInputCol(features)
+  .setOutputCol(featuresDCT)
+  .setInverse(false);
+DataFrame dctDf = dct.transform(df);
+dctDf.select(featuresDCT).take(3).show(3);
--- End diff --

Remove take(3)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10012][ML] Missing test case for Params...