date:20171103

[GitHub] spark issue #19647: [SPARK-22211][SQL] Remove incorrect FOJ limit pushdown

2017-11-03 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19647
  
**[Test build #83432 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83432/testReport)**
 for PR 19647 at commit 
[`0e0123b`](https://github.com/apache/spark/commit/0e0123b0887aa33c382a63ee7843be5f64d38b41).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19647: [SPARK-22211][SQL] Remove incorrect FOJ limit pushdown

2017-11-03 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19647
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19650: [SPARK-22254][core] Fix the arrayMax in BufferHol...

2017-11-03 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19650


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19650: [SPARK-22254][core] Fix the arrayMax in BufferHolder

2017-11-03 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19650
  
Thanks! Merged to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileF...

2017-11-03 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19651
  
**[Test build #83431 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83431/testReport)**
 for PR 19651 at commit 
[`fdde274`](https://github.com/apache/spark/commit/fdde27416fe54036afbd9b809a363e7871df67cf).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileF...

2017-11-03 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/19651
  
Retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compression.co...

2017-11-03 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19218
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19653: [SPARK-22418][SQL][TEST] Add test cases for NULL ...

2017-11-03 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19653


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19653: [SPARK-22418][SQL][TEST] Add test cases for NULL Handlin...

2017-11-03 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19653
  
Thanks! Merged to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19647: [SPARK-22211][SQL] Remove incorrect FOJ limit pushdown

2017-11-03 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19647
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83430/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19647: [SPARK-22211][SQL] Remove incorrect FOJ limit pushdown

2017-11-03 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19647
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19647: [SPARK-22211][SQL] Remove incorrect FOJ limit pushdown

2017-11-03 Thread henryr

Github user henryr commented on the issue:

https://github.com/apache/spark/pull/19647
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19468: [SPARK-18278] [Scheduler] Spark on Kubernetes - Basic Sc...

2017-11-03 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19468
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19468: [SPARK-18278] [Scheduler] Spark on Kubernetes - Basic Sc...

2017-11-03 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19468
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83425/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19468: [SPARK-18278] [Scheduler] Spark on Kubernetes - Basic Sc...

2017-11-03 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19468
  
**[Test build #83425 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83425/testReport)**
 for PR 19468 at commit 
[`6cf4ed7`](https://github.com/apache/spark/commit/6cf4ed7eec3f8a1798d260622ab5641db92ab13d).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19582: [SPARK-20644][core] Initial ground work for kvstore UI b...

2017-11-03 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19582
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83424/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19582: [SPARK-20644][core] Initial ground work for kvstore UI b...

2017-11-03 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19582
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19582: [SPARK-20644][core] Initial ground work for kvstore UI b...

2017-11-03 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19582
  
**[Test build #83424 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83424/testReport)**
 for PR 19582 at commit 
[`537c7b4`](https://github.com/apache/spark/commit/537c7b4bf27e1392fd4868db688f7d2f48baa3a5).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19656: [SPARK-22445][SQL] move CodegenContext.copyResult to Cod...

2017-11-03 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/19656
  
A rough glance looks good. Will review carefully after solving the build.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19642: [SPARK-22410][SQL] Remove unnecessary output from BatchE...

2017-11-03 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/19642
  
About why `ExtractPythonUDFs` applies on physical plans, I think it may 
partly because the wrapper of Python function `PythonFunction` also 
encapsulates core concepts like `Broadcast` and `Accumulator`, it is out of the 
scope of catalyst APIs.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19642: [SPARK-22410][SQL] Remove unnecessary output from BatchE...

2017-11-03 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/19642
  
Another concern is, by seeing python udfs as normal expressions without 
specific operator, we can apply necessary optimization such as CollapseProject. 
If we extract python udfs earlier in logical plan, we might have multiple 
python runners that were collapsed originally now.

To deal with it, we may need to add specific optimization rule for 
collapsing python runners.

This adds more complexity, IMHO. We should consider if it's worth.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19656: [SPARK-22445][SQL] move CodegenContext.copyResult to Cod...

2017-11-03 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19656
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83429/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19656: [SPARK-22445][SQL] move CodegenContext.copyResult to Cod...

2017-11-03 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19656
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19656: [SPARK-22445][SQL] move CodegenContext.copyResult to Cod...

2017-11-03 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19656
  
**[Test build #83429 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83429/testReport)**
 for PR 19656 at commit 
[`94beff0`](https://github.com/apache/spark/commit/94beff0e229da4c5dd3cb3ddb677e5fe6e42499c).
 * This patch **fails to build**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19656: [SPARK-22445][SQL] move CodegenContext.copyResult to Cod...

2017-11-03 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19656
  
**[Test build #83429 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83429/testReport)**
 for PR 19656 at commit 
[`94beff0`](https://github.com/apache/spark/commit/94beff0e229da4c5dd3cb3ddb677e5fe6e42499c).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19656: [SPARK-22445][SQL] move CodegenContext.copyResult to Cod...

2017-11-03 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/19656
  
Jenkins, retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19638: [SPARK-22422][ML] Add Adjusted R2 to RegressionMe...

2017-11-03 Thread tengpeng

Github user tengpeng commented on a diff in the pull request:

https://github.com/apache/spark/pull/19638#discussion_r148919520
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala 
---
@@ -764,13 +764,17 @@ class LinearRegressionSuite
   (Intercept) 6.3022157  0.00186003388   <2e-16 ***
   V2  4.6982442  0.00118053980   <2e-16 ***
   V3  7.1994344  0.00090447961   <2e-16 ***
+
+  # R code for r2adj
+  summary(lm(X1 ~ ., data = part_0))
--- End diff --

The existing test "Linear regression model training summary" does not test 
without intercept case. If there is a "Linear regression model training summary 
wo intercept", I am glad to add in it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19623: [SPARK-22078][SQL] clarify exception behaviors for all d...

2017-11-03 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19623
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83428/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19623: [SPARK-22078][SQL] clarify exception behaviors for all d...

2017-11-03 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19623
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileF...

2017-11-03 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19651
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileF...

2017-11-03 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19651
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83427/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileF...

2017-11-03 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/19651
  
Retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19623: [SPARK-22078][SQL] clarify exception behaviors fo...

2017-11-03 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19623#discussion_r148918617
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/ReadSupport.java ---
@@ -30,6 +30,9 @@
   /**
* Creates a {@link DataSourceV2Reader} to scan the data from this data 
source.
*
+   * If this method fails (by throwing an exception), the action would 
fail and no Spark job was
--- End diff --

no exception is recommended here, you can throw any exception and Spark 
will fail your job. This document just describes what will happen if an 
exception is thrown, but no exception is expected actually.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19623: [SPARK-22078][SQL] clarify exception behaviors fo...

2017-11-03 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19623#discussion_r148918542
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataSourceV2Writer.java
 ---
@@ -50,28 +53,34 @@
 
   /**
* Creates a writer factory which will be serialized and sent to 
executors.
+   *
+   * If this method fails (by throwing an exception), the action would 
fail and no Spark job was
+   * submitted.
*/
   DataWriterFactory createWriterFactory();
 
   /**
* Commits this writing job with a list of commit messages. The commit 
messages are collected from
-   * successful data writers and are produced by {@link 
DataWriter#commit()}. If this method
-   * fails(throw exception), this writing job is considered to be failed, 
and
-   * {@link #abort(WriterCommitMessage[])} will be called. The written 
data should only be visible
-   * to data source readers if this method succeeds.
+   * successful data writers and are produced by {@link 
DataWriter#commit()}.
+   *
+   * If this method fails (by throwing an exception), this writing job is 
considered to to have been
+   * failed, and {@link #abort(WriterCommitMessage[])} would be called. 
The state of the destination
+   * is undefined and @{@link #abort(WriterCommitMessage[])} may not be 
able to deal with it.
*
* Note that, one partition may have multiple committed data writers 
because of speculative tasks.
* Spark will pick the first successful one and get its commit message. 
Implementations should be
--- End diff --

it's really out of the scope of this PR...

This PR is adding documents to describe exception behavior, please send a 
new PR to update the document for the commit stuff and we can move our 
discussion there. Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19653: [SPARK-22418][SQL][TEST] Add test cases for NULL Handlin...

2017-11-03 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19653
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83421/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19653: [SPARK-22418][SQL][TEST] Add test cases for NULL Handlin...

2017-11-03 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19653
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19653: [SPARK-22418][SQL][TEST] Add test cases for NULL Handlin...

2017-11-03 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19653
  
**[Test build #83421 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83421/testReport)**
 for PR 19653 at commit 
[`92308a4`](https://github.com/apache/spark/commit/92308a4341849258caf549d1bcbeabd9002d3ead).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19583: [WIP][SPARK-22339] [CORE] [NETWORK-SHUFFLE] Push ...

2017-11-03 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19583#discussion_r148917664
  
--- Diff: core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala ---
@@ -51,7 +51,26 @@ private case class ExecutorRegistered(executorId: String)
 
 private case class ExecutorRemoved(executorId: String)
 
-private[spark] case class HeartbeatResponse(reregisterBlockManager: 
Boolean)
+private[spark] case class UpdatedEpoch(epoch: Long)
+
+private[spark] object HeartbeatResponse {
+  def apply(reregisterBlockManager: Boolean,
+updatedEpoch: Option[Long] = None): HeartbeatResponse =
+
updatedEpoch.fold[HeartbeatResponse](BasicHeartbeatResponse(reregisterBlockManager))
 {
+  epoch => HeartbeatResponseWithEpoch(reregisterBlockManager, 
Some(epoch))
+}
+}
+
+private[spark] sealed trait HeartbeatResponse {
+  def reregisterBlockManager: Boolean
+  def updatedEpoch: Option[Long] = None
+}
+
+private[spark] case class BasicHeartbeatResponse(reregisterBlockManager: 
Boolean)
+  extends HeartbeatResponse
+private[spark] case class 
HeartbeatResponseWithEpoch(reregisterBlockManager: Boolean,
+ override val 
updatedEpoch: Option[Long])
+  extends HeartbeatResponse
--- End diff --

seems not a big saving, I think we can even always include the epoch in the 
heartbeat response.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19656: [SPARK-22445][SQL] move CodegenContext.copyResult to Cod...

2017-11-03 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19656
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83426/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19656: [SPARK-22445][SQL] move CodegenContext.copyResult to Cod...

2017-11-03 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19656
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19656: [SPARK-22445][SQL] move CodegenContext.copyResult...

2017-11-03 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19656#discussion_r148917096
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala
 ---
@@ -213,19 +213,32 @@ trait CodegenSupport extends SparkPlan {
   }
 
   /**
-   * For optimization to suppress shouldStop() in a loop of 
WholeStageCodegen.
-   * Returning true means we need to insert shouldStop() into the loop 
producing rows, if any.
+   * Whether or not the result rows of this operator should be copied 
before putting into a buffer.
+   *
+   * If any operator inside WholeStageCodegen generate multiple rows from 
a single row (for
+   * example, Join), this should be true.
+   *
+   * If an operator starts a new pipeline, this should be false.
*/
-  def isShouldStopRequired: Boolean = {
-return shouldStopRequired && (this.parent == null || 
this.parent.isShouldStopRequired)
+  protected def needCopyResult: Boolean = {
+if (children.isEmpty) {
+  false
+} else if (children.length == 1) {
+  children.head.asInstanceOf[CodegenSupport].needStopCheck
+} else {
+  throw new UnsupportedOperationException
+}
   }
 
   /**
-   * Set to false if this plan consumes all rows produced by children but 
doesn't output row
-   * to buffer by calling append(), so the children don't require 
shouldStop()
-   * in the loop of producing rows.
+   * Whether or not the children of this operator should generate a stop 
check when consuming input
+   * rows. This is used to suppress shouldStop() in a loop of 
WholeStageCodegen.
+   *
+   * This should be false if an operator starts a new pipeline, which 
means it consumes all rows
+   * produced by children but doesn't output row to buffer by calling 
append(),  so the children
+   * don't require shouldStop() in the loop of producing rows.
*/
-  protected def shouldStopRequired: Boolean = true
+  protected def needStopCheck: Boolean = parent.needStopCheck
--- End diff --

it's unrelated but a simple clean up: we only need one method instead of 2.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19656: [SPARK-22445][SQL] move CodegenContext.copyResult to Cod...

2017-11-03 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/19656
  
@juliuszsompolski @rednaxelafx @kiszk @viirya @gatorsmile 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19656: [SPARK-22445][SQL] move CodegenContext.copyResult...

2017-11-03 Thread cloud-fan

GitHub user cloud-fan opened a pull request:

https://github.com/apache/spark/pull/19656

[SPARK-22445][SQL] move CodegenContext.copyResult to CodegenSupport

## What changes were proposed in this pull request?

`CodegenContext.copyResult` is kind of a global status for whole stage 
codegen. But the tricky part is, it is only used to transfer an information 
from child to parent when calling the `consume` chain. We have to be super 
careful in `produce`/`consume`, to set it to true when producing multiple 
result rows, and set it to false in operators that start new pipeline(like 
sort).

This PR moves the `copyResult` to `CodegenSupport`, and call it at 
`WholeStageCodegenExec`. This is much easier to reason about.

## How was this patch tested?

existing tests

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/cloud-fan/spark whole-sage

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19656.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19656


commit 94beff0e229da4c5dd3cb3ddb677e5fe6e42499c
Author: Wenchen Fan 
Date:   2017-11-04T00:14:22Z

move CodegenContext.copyResult to CodegenSupport




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19468: [SPARK-18278] [Scheduler] Spark on Kubernetes - Basic Sc...

2017-11-03 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19468
  
**[Test build #83425 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83425/testReport)**
 for PR 19468 at commit 
[`6cf4ed7`](https://github.com/apache/spark/commit/6cf4ed7eec3f8a1798d260622ab5641db92ab13d).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19468: [SPARK-18278] [Scheduler] Spark on Kubernetes - B...

2017-11-03 Thread mccheah

Github user mccheah commented on a diff in the pull request:

https://github.com/apache/spark/pull/19468#discussion_r148916581
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/cluster/SchedulerBackendUtils.scala
 ---
@@ -0,0 +1,47 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.scheduler.cluster
+
+import org.apache.spark.SparkConf
+import org.apache.spark.internal.config.{DYN_ALLOCATION_MAX_EXECUTORS, 
DYN_ALLOCATION_MIN_EXECUTORS, EXECUTOR_INSTANCES}
+import org.apache.spark.util.Utils
+
+private[spark] object SchedulerBackendUtils {
+  val DEFAULT_NUMBER_EXECUTORS = 2
--- End diff --

Done


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19468: [SPARK-18278] [Scheduler] Spark on Kubernetes - B...

2017-11-03 Thread mccheah

Github user mccheah commented on a diff in the pull request:

https://github.com/apache/spark/pull/19468#discussion_r148916566
  
--- Diff: 
resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodFactory.scala
 ---
@@ -0,0 +1,227 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.scheduler.cluster.k8s
+
+import scala.collection.JavaConverters._
+
+import io.fabric8.kubernetes.api.model._
+
+import org.apache.spark.{SparkConf, SparkException}
+import org.apache.spark.deploy.k8s.ConfigurationUtils
+import org.apache.spark.deploy.k8s.config._
+import org.apache.spark.deploy.k8s.constants._
+import org.apache.spark.util.Utils
+
+/**
+ * Configures executor pods. Construct one of these with a SparkConf to 
set up properties that are
+ * common across all executors. Then, pass in dynamic parameters into 
createExecutorPod.
+ */
+private[spark] trait ExecutorPodFactory {
+  def createExecutorPod(
+  executorId: String,
+  applicationId: String,
+  driverUrl: String,
+  executorEnvs: Seq[(String, String)],
+  driverPod: Pod,
+  nodeToLocalTaskCount: Map[String, Int]): Pod
+}
+
+private[spark] class ExecutorPodFactoryImpl(sparkConf: SparkConf)
+  extends ExecutorPodFactory {
+
+  import ExecutorPodFactoryImpl._
+
+  private val executorExtraClasspath = sparkConf.get(
+org.apache.spark.internal.config.EXECUTOR_CLASS_PATH)
+
+  private val executorLabels = 
ConfigurationUtils.parsePrefixedKeyValuePairs(
+sparkConf,
+KUBERNETES_EXECUTOR_LABEL_PREFIX,
+"executor label")
+  require(
+!executorLabels.contains(SPARK_APP_ID_LABEL),
+s"Custom executor labels cannot contain $SPARK_APP_ID_LABEL as it is 
reserved for Spark.")
+  require(
+!executorLabels.contains(SPARK_EXECUTOR_ID_LABEL),
+s"Custom executor labels cannot contain $SPARK_EXECUTOR_ID_LABEL as it 
is reserved for" +
+  s" Spark.")
+
+  private val executorAnnotations =
+ConfigurationUtils.parsePrefixedKeyValuePairs(
+  sparkConf,
+  KUBERNETES_EXECUTOR_ANNOTATION_PREFIX,
+  "executor annotation")
+  private val nodeSelector =
+ConfigurationUtils.parsePrefixedKeyValuePairs(
+  sparkConf,
+  KUBERNETES_NODE_SELECTOR_PREFIX,
+  "node selector")
+
+  private val executorDockerImage = sparkConf.get(EXECUTOR_DOCKER_IMAGE)
+  private val dockerImagePullPolicy = 
sparkConf.get(DOCKER_IMAGE_PULL_POLICY)
+  private val executorPort = sparkConf.getInt("spark.executor.port", 
DEFAULT_STATIC_PORT)
+  private val blockmanagerPort = sparkConf
+.getInt("spark.blockmanager.port", DEFAULT_BLOCKMANAGER_PORT)
+  private val kubernetesDriverPodName = sparkConf
+.get(KUBERNETES_DRIVER_POD_NAME)
+.getOrElse(throw new SparkException("Must specify the driver pod 
name"))
+
+  private val executorPodNamePrefix = 
sparkConf.get(KUBERNETES_EXECUTOR_POD_NAME_PREFIX)
+
+  private val executorMemoryMiB = 
sparkConf.get(org.apache.spark.internal.config.EXECUTOR_MEMORY)
+  private val executorMemoryString = sparkConf.get(
--- End diff --

More context here: https://github.com/apache-spark-on-k8s/spark/pull/470


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19468: [SPARK-18278] [Scheduler] Spark on Kubernetes - B...

2017-11-03 Thread mccheah

Github user mccheah commented on a diff in the pull request:

https://github.com/apache/spark/pull/19468#discussion_r148916475
  
--- Diff: 
resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodFactory.scala
 ---
@@ -0,0 +1,227 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.scheduler.cluster.k8s
+
+import scala.collection.JavaConverters._
+
+import io.fabric8.kubernetes.api.model._
+
+import org.apache.spark.{SparkConf, SparkException}
+import org.apache.spark.deploy.k8s.ConfigurationUtils
+import org.apache.spark.deploy.k8s.config._
+import org.apache.spark.deploy.k8s.constants._
+import org.apache.spark.util.Utils
+
+/**
+ * Configures executor pods. Construct one of these with a SparkConf to 
set up properties that are
+ * common across all executors. Then, pass in dynamic parameters into 
createExecutorPod.
+ */
+private[spark] trait ExecutorPodFactory {
+  def createExecutorPod(
+  executorId: String,
+  applicationId: String,
+  driverUrl: String,
+  executorEnvs: Seq[(String, String)],
+  driverPod: Pod,
+  nodeToLocalTaskCount: Map[String, Int]): Pod
+}
+
+private[spark] class ExecutorPodFactoryImpl(sparkConf: SparkConf)
+  extends ExecutorPodFactory {
+
+  import ExecutorPodFactoryImpl._
+
+  private val executorExtraClasspath = sparkConf.get(
+org.apache.spark.internal.config.EXECUTOR_CLASS_PATH)
+
+  private val executorLabels = 
ConfigurationUtils.parsePrefixedKeyValuePairs(
+sparkConf,
+KUBERNETES_EXECUTOR_LABEL_PREFIX,
+"executor label")
+  require(
--- End diff --

Done


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19468: [SPARK-18278] [Scheduler] Spark on Kubernetes - B...

2017-11-03 Thread mccheah

Github user mccheah commented on a diff in the pull request:

https://github.com/apache/spark/pull/19468#discussion_r148916344
  
--- Diff: 
resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodFactory.scala
 ---
@@ -0,0 +1,227 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.scheduler.cluster.k8s
+
+import scala.collection.JavaConverters._
+
+import io.fabric8.kubernetes.api.model._
+
+import org.apache.spark.{SparkConf, SparkException}
+import org.apache.spark.deploy.k8s.ConfigurationUtils
+import org.apache.spark.deploy.k8s.config._
+import org.apache.spark.deploy.k8s.constants._
+import org.apache.spark.util.Utils
+
+/**
+ * Configures executor pods. Construct one of these with a SparkConf to 
set up properties that are
+ * common across all executors. Then, pass in dynamic parameters into 
createExecutorPod.
+ */
+private[spark] trait ExecutorPodFactory {
+  def createExecutorPod(
+  executorId: String,
+  applicationId: String,
+  driverUrl: String,
+  executorEnvs: Seq[(String, String)],
+  driverPod: Pod,
+  nodeToLocalTaskCount: Map[String, Int]): Pod
+}
+
+private[spark] class ExecutorPodFactoryImpl(sparkConf: SparkConf)
+  extends ExecutorPodFactory {
+
+  import ExecutorPodFactoryImpl._
+
+  private val executorExtraClasspath = sparkConf.get(
+org.apache.spark.internal.config.EXECUTOR_CLASS_PATH)
+
+  private val executorLabels = 
ConfigurationUtils.parsePrefixedKeyValuePairs(
+sparkConf,
+KUBERNETES_EXECUTOR_LABEL_PREFIX,
+"executor label")
+  require(
+!executorLabels.contains(SPARK_APP_ID_LABEL),
+s"Custom executor labels cannot contain $SPARK_APP_ID_LABEL as it is 
reserved for Spark.")
+  require(
+!executorLabels.contains(SPARK_EXECUTOR_ID_LABEL),
+s"Custom executor labels cannot contain $SPARK_EXECUTOR_ID_LABEL as it 
is reserved for" +
+  s" Spark.")
+
+  private val executorAnnotations =
+ConfigurationUtils.parsePrefixedKeyValuePairs(
+  sparkConf,
+  KUBERNETES_EXECUTOR_ANNOTATION_PREFIX,
+  "executor annotation")
+  private val nodeSelector =
+ConfigurationUtils.parsePrefixedKeyValuePairs(
+  sparkConf,
+  KUBERNETES_NODE_SELECTOR_PREFIX,
+  "node selector")
+
+  private val executorDockerImage = sparkConf.get(EXECUTOR_DOCKER_IMAGE)
+  private val dockerImagePullPolicy = 
sparkConf.get(DOCKER_IMAGE_PULL_POLICY)
+  private val executorPort = sparkConf.getInt("spark.executor.port", 
DEFAULT_STATIC_PORT)
+  private val blockmanagerPort = sparkConf
+.getInt("spark.blockmanager.port", DEFAULT_BLOCKMANAGER_PORT)
+  private val kubernetesDriverPodName = sparkConf
+.get(KUBERNETES_DRIVER_POD_NAME)
+.getOrElse(throw new SparkException("Must specify the driver pod 
name"))
+
+  private val executorPodNamePrefix = 
sparkConf.get(KUBERNETES_EXECUTOR_POD_NAME_PREFIX)
+
+  private val executorMemoryMiB = 
sparkConf.get(org.apache.spark.internal.config.EXECUTOR_MEMORY)
+  private val executorMemoryString = sparkConf.get(
--- End diff --

Kubernetes will treat MiB differently from MB I believe. The JVM's 
formatting of memory strings is different from what Kubernetes expects in its 
API.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19468: [SPARK-18278] [Scheduler] Spark on Kubernetes - B...

2017-11-03 Thread mccheah

Github user mccheah commented on a diff in the pull request:

https://github.com/apache/spark/pull/19468#discussion_r148916191
  
--- Diff: 
resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/SparkKubernetesClientFactory.scala
 ---
@@ -0,0 +1,103 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.deploy.k8s
+
+import java.io.File
+
+import com.google.common.base.Charsets
+import com.google.common.io.Files
+import io.fabric8.kubernetes.client.{Config, ConfigBuilder, 
DefaultKubernetesClient, KubernetesClient}
--- End diff --

@jiangxb1987 Is this the convention across the codebase? I don't think most 
IDEs will preserve that formatting.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19655: [SPARK-22444][core] Spark History Server missing /enviro...

2017-11-03 Thread vanzin

Github user vanzin commented on the issue:

https://github.com/apache/spark/pull/19655
  
@ambud please close this, we don't add features to old releases (only bugs 
fixes).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19655: [SPARK-22444][core] Spark History Server missing /enviro...

2017-11-03 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19655
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19655: [SPARK-22444][core] Spark History Server missing ...

2017-11-03 Thread ambud

GitHub user ambud opened a pull request:

https://github.com/apache/spark/pull/19655

[SPARK-22444][core] Spark History Server missing /environment endpoint/api

## What changes were proposed in this pull request?
Spark History Server REST API is missing the /environment endpoint. This 
endpoint is otherwise available in Spark 2.x however it's missing from 1.6.x.
Since the environment endpoint provides programmatic access to critical 
launch parameters it would be great to have this back ported.
This feature was contributed under: 
https://issues.apache.org/jira/browse/SPARK-16122


## How was this patch tested?

Will update this shortly!

(Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ambud/spark branch-1.6

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19655.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19655


commit a0d549dd566d4683bd44fd162e776fa340f87428
Author: ambud 
Date:   2017-11-03T23:40:26Z

Adding environment api SPARK-22444




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19582: [SPARK-20644][core] Initial ground work for kvstore UI b...

2017-11-03 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19582
  
**[Test build #83424 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83424/testReport)**
 for PR 19582 at commit 
[`537c7b4`](https://github.com/apache/spark/commit/537c7b4bf27e1392fd4868db688f7d2f48baa3a5).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19582: [SPARK-20644][core] Initial ground work for kvstore UI b...

2017-11-03 Thread vanzin

Github user vanzin commented on the issue:

https://github.com/apache/spark/pull/19582
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19582: [SPARK-20644][core] Initial ground work for kvsto...

2017-11-03 Thread vanzin

Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19582#discussion_r148913864
  
--- Diff: 
core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala ---
@@ -486,8 +580,21 @@ private[history] class FsHistoryProvider(conf: 
SparkConf, clock: Clock)
 bus.addListener(listener)
 
 replay(fileStatus, isApplicationCompleted(fileStatus), bus, 
eventsFilter)
-listener.applicationInfo.foreach(addListing)
-listing.write(LogInfo(logPath.toString(), fileStatus.getLen()))
+listener.applicationInfo.foreach { app =>
+  // Invalidate the existing UI for the reloaded app attempt, if any. 
Note that this does
+  // not remove the UI from the active list; that has to be done in 
onUIDetached, so that
+  // cleanup of files can be done in a thread-safe manner. It does 
mean the UI will remain
+  // in memory for longer than it should.
--- End diff --

Done. Also updated a bunch of other stale comments.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileF...

2017-11-03 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19651
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-11-03 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/19439#discussion_r148909027
  
--- Diff: python/pyspark/ml/image.py ---
@@ -0,0 +1,192 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""
+.. attribute:: ImageSchema
+
+A singleton-like attribute of :class:`_ImageSchema` in this module.
--- End diff --

Thanks for the thoughts!  Keeping it as is sounds good to me.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileF...

2017-11-03 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19651
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83423/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-11-03 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/19439#discussion_r148908923
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala 
---
@@ -0,0 +1,236 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.image
+
+import java.awt.Color
+import java.awt.color.ColorSpace
+import java.io.ByteArrayInputStream
+import javax.imageio.ImageIO
+
+import scala.collection.JavaConverters._
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.input.PortableDataStream
+import org.apache.spark.sql.{DataFrame, Row, SparkSession}
+import org.apache.spark.sql.types._
+
+@Experimental
+@Since("2.3.0")
+object ImageSchema {
+
+  val undefinedImageType = "Undefined"
+
+  val imageFields: Array[String] = Array("origin", "height", "width", 
"nChannels", "mode", "data")
+
+  val ocvTypes: Map[String, Int] = Map(
+undefinedImageType -> -1,
+"CV_8U" -> 0, "CV_8UC1" -> 0, "CV_8UC3" -> 16, "CV_8UC4" -> 24
+  )
+
+  /**
+   * Used for conversion to python
+   */
+  val _ocvTypes: java.util.Map[String, Int] = ocvTypes.asJava
+
+  /**
+   * Schema for the image column: Row(String, Int, Int, Int, Int, 
Array[Byte])
+   */
+  val columnSchema = StructType(
+StructField(imageFields(0), StringType, true) ::
+StructField(imageFields(1), IntegerType, false) ::
+StructField(imageFields(2), IntegerType, false) ::
+StructField(imageFields(3), IntegerType, false) ::
+// OpenCV-compatible type: CV_8UC3 in most cases
+StructField(imageFields(4), IntegerType, false) ::
+// Bytes in OpenCV-compatible order: row-wise BGR in most cases
+StructField(imageFields(5), BinaryType, false) :: Nil)
+
+  /**
+   * DataFrame with a single column of images named "image" (nullable)
+   */
+  val imageSchema = StructType(StructField("image", columnSchema, true) :: 
Nil)
+
+  /**
+   * :: Experimental ::
+   * Gets the origin of the image
+   *
+   * @return The origin of the image
+   */
+  def getOrigin(row: Row): String = row.getString(0)
+
+  /**
+   * :: Experimental ::
+   * Gets the height of the image
+   *
+   * @return The height of the image
+   */
+  def getHeight(row: Row): Int = row.getInt(1)
+
+  /**
+   * :: Experimental ::
+   * Gets the width of the image
+   *
+   * @return The width of the image
+   */
+  def getWidth(row: Row): Int = row.getInt(2)
+
+  /**
+   * :: Experimental ::
+   * Gets the number of channels in the image
+   *
+   * @return The number of channels in the image
+   */
+  def getNChannels(row: Row): Int = row.getInt(3)
+
+  /**
+   * :: Experimental ::
+   * Gets the OpenCV representation as an int
+   *
+   * @return The OpenCV representation as an int
+   */
+  def getMode(row: Row): Int = row.getInt(4)
+
+  /**
+   * :: Experimental ::
+   * Gets the image data
+   *
+   * @return The image data
+   */
+  def getData(row: Row): Array[Byte] = row.getAs[Array[Byte]](5)
+
+  /**
+   * Default values for the invalid image
+   *
+   * @param origin Origin of the invalid image
+   * @return Row with the default values
+   */
+  private def invalidImageRow(origin: String): Row =
+Row(Row(origin, -1, -1, -1, ocvTypes(undefinedImageType), 
Array.ofDim[Byte](0)))
+
+  /**
+   * Convert the compressed image (jpeg, png, etc.) into OpenCV
+   * representation and store it in DataFrame Row
+   *
+   * @param origin Arbitrary string that identifies the image
+   * @param bytes Image bytes (for example, jpeg)
+   * @return DataFrame Row or None (if the decompression fails)
+   */
+  private[spark] def decode(origin: String, bytes: Array[Byte]): 
Option[Row] = {
+
+

[GitHub] spark issue #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileF...

2017-11-03 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/19651
  
Retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19208: [SPARK-21087] [ML] CrossValidator, TrainValidationSplit ...

2017-11-03 Thread jkbradley

Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/19208
  
Done with review.  I mainly review CrossValidator since some comments will 
apply to TrainValidationSplit as well.  Thanks for the PR!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19208: [SPARK-21087] [ML] CrossValidator, TrainValidatio...

2017-11-03 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/19208#discussion_r148885057
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala ---
@@ -271,6 +303,20 @@ class CrossValidatorModel private[ml] (
 @Since("1.6.0")
 object CrossValidatorModel extends MLReadable[CrossValidatorModel] {
 
+  private[CrossValidatorModel] def copySubModels(subModels: 
Option[Array[Array[Model[_) = {
--- End diff --

style: state return value explicitly


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19208: [SPARK-21087] [ML] CrossValidator, TrainValidatio...

2017-11-03 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/19208#discussion_r148885486
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala ---
@@ -271,6 +303,20 @@ class CrossValidatorModel private[ml] (
 @Since("1.6.0")
 object CrossValidatorModel extends MLReadable[CrossValidatorModel] {
 
+  private[CrossValidatorModel] def copySubModels(subModels: 
Option[Array[Array[Model[_) = {
+subModels.map { subModels =>
--- End diff --

Can this be simplified using map?
```
subModels.map(_.map(_.map(_.copy(...).asInstanceOf[...])))
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19208: [SPARK-21087] [ML] CrossValidator, TrainValidatio...

2017-11-03 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/19208#discussion_r148886008
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala ---
@@ -282,12 +328,40 @@ object CrossValidatorModel extends 
MLReadable[CrossValidatorModel] {
 
 ValidatorParams.validateParams(instance)
 
+protected var shouldPersistSubModels: Boolean = if 
(instance.hasSubModels) true else false
+
+/**
+ * Extra options for CrossValidatorModelWriter, current support 
"persistSubModels".
+ * if sub models exsit, the default value for option 
"persistSubModels" is "true".
+ */
+@Since("2.3.0")
+override def option(key: String, value: String): this.type = {
+  key.toLowerCase(Locale.ROOT) match {
+case "persistsubmodels" => shouldPersistSubModels = value.toBoolean
+case _ => throw new IllegalArgumentException(
+  s"Illegal option ${key} for CrossValidatorModelWriter")
+  }
+  this
+}
+
 override protected def saveImpl(path: String): Unit = {
   import org.json4s.JsonDSL._
-  val extraMetadata = "avgMetrics" -> instance.avgMetrics.toSeq
+  val extraMetadata = ("avgMetrics" -> instance.avgMetrics.toSeq) ~
+("shouldPersistSubModels" -> shouldPersistSubModels)
   ValidatorParams.saveImpl(path, instance, sc, Some(extraMetadata))
   val bestModelPath = new Path(path, "bestModel").toString
   instance.bestModel.asInstanceOf[MLWritable].save(bestModelPath)
+  if (shouldPersistSubModels) {
+require(instance.hasSubModels, "Cannot get sub models to persist.")
--- End diff --

This error message may be unclear.  How about adding: "When persisting 
tuning models, you can only set persistSubModels to true if the tuning was done 
with collectSubModels set to true.  To save the sub-models, try rerunning 
fitting with collectSubModels set to true."


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19208: [SPARK-21087] [ML] CrossValidator, TrainValidatio...

2017-11-03 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/19208#discussion_r148886190
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala ---
@@ -282,12 +328,40 @@ object CrossValidatorModel extends 
MLReadable[CrossValidatorModel] {
 
 ValidatorParams.validateParams(instance)
 
+protected var shouldPersistSubModels: Boolean = if 
(instance.hasSubModels) true else false
+
+/**
+ * Extra options for CrossValidatorModelWriter, current support 
"persistSubModels".
+ * if sub models exsit, the default value for option 
"persistSubModels" is "true".
+ */
+@Since("2.3.0")
+override def option(key: String, value: String): this.type = {
+  key.toLowerCase(Locale.ROOT) match {
+case "persistsubmodels" => shouldPersistSubModels = value.toBoolean
+case _ => throw new IllegalArgumentException(
+  s"Illegal option ${key} for CrossValidatorModelWriter")
+  }
+  this
+}
+
 override protected def saveImpl(path: String): Unit = {
   import org.json4s.JsonDSL._
-  val extraMetadata = "avgMetrics" -> instance.avgMetrics.toSeq
+  val extraMetadata = ("avgMetrics" -> instance.avgMetrics.toSeq) ~
+("shouldPersistSubModels" -> shouldPersistSubModels)
   ValidatorParams.saveImpl(path, instance, sc, Some(extraMetadata))
   val bestModelPath = new Path(path, "bestModel").toString
   instance.bestModel.asInstanceOf[MLWritable].save(bestModelPath)
+  if (shouldPersistSubModels) {
+require(instance.hasSubModels, "Cannot get sub models to persist.")
+val subModelsPath = new Path(path, "subModels")
+for (splitIndex <- 0 until instance.getNumFolds) {
+  val splitPath = new Path(subModelsPath, splitIndex.toString)
--- End diff --

How about naming this with the string "fold":
```splitIndex.toString``` --> ```"fold" + splitIndex.toString```?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19208: [SPARK-21087] [ML] CrossValidator, TrainValidatio...

2017-11-03 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/19208#discussion_r148885817
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala ---
@@ -282,12 +328,40 @@ object CrossValidatorModel extends 
MLReadable[CrossValidatorModel] {
 
 ValidatorParams.validateParams(instance)
 
+protected var shouldPersistSubModels: Boolean = if 
(instance.hasSubModels) true else false
+
+/**
+ * Extra options for CrossValidatorModelWriter, current support 
"persistSubModels".
+ * if sub models exsit, the default value for option 
"persistSubModels" is "true".
+ */
+@Since("2.3.0")
+override def option(key: String, value: String): this.type = {
+  key.toLowerCase(Locale.ROOT) match {
+case "persistsubmodels" => shouldPersistSubModels = value.toBoolean
+case _ => throw new IllegalArgumentException(
+  s"Illegal option ${key} for CrossValidatorModelWriter")
+  }
+  this
+}
+
 override protected def saveImpl(path: String): Unit = {
   import org.json4s.JsonDSL._
-  val extraMetadata = "avgMetrics" -> instance.avgMetrics.toSeq
+  val extraMetadata = ("avgMetrics" -> instance.avgMetrics.toSeq) ~
+("shouldPersistSubModels" -> shouldPersistSubModels)
--- End diff --

Let's have 1 name for this argument: "persistSubModels"


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19208: [SPARK-21087] [ML] CrossValidator, TrainValidatio...

2017-11-03 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/19208#discussion_r148908525
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/tuning/CrossValidatorSuite.scala ---
@@ -187,6 +191,50 @@ class CrossValidatorSuite
   .compareParamMaps(cv.getEstimatorParamMaps, 
cv2.getEstimatorParamMaps)
   }
 
+  test("CrossValidator expose sub models") {
+val lr = new LogisticRegression
+val lrParamMaps = new ParamGridBuilder()
+  .addGrid(lr.regParam, Array(0.001, 1000.0))
+  .addGrid(lr.maxIter, Array(0, 3))
+  .build()
+val eval = new BinaryClassificationEvaluator
+val numFolds = 3
+val subPath = new File(tempDir, "testCrossValidatorSubModels")
+val persistSubModelsPath = new File(subPath, "subModels").toString
+
+val cv = new CrossValidator()
+  .setEstimator(lr)
+  .setEstimatorParamMaps(lrParamMaps)
+  .setEvaluator(eval)
+  .setNumFolds(numFolds)
+  .setParallelism(1)
+  .setCollectSubModels(true)
+
+val cvModel = cv.fit(dataset)
+
+assert(cvModel.hasSubModels && cvModel.subModels.length == numFolds)
+cvModel.subModels.foreach(array => assert(array.length == 
lrParamMaps.length))
+
+// Test the default value for option "persistSubModel" to be "true"
+val savingPathWithSubModels = new File(subPath, "cvModel3").getPath
+cvModel.save(savingPathWithSubModels)
+val cvModel3 = CrossValidatorModel.load(savingPathWithSubModels)
+assert(cvModel3.hasSubModels && cvModel3.subModels.length == numFolds)
+cvModel3.subModels.foreach(array => assert(array.length == 
lrParamMaps.length))
+
+val savingPathWithoutSubModels = new File(subPath, "cvModel2").getPath
+cvModel.write.option("persistSubModels", 
"false").save(savingPathWithoutSubModels)
+val cvModel2 = CrossValidatorModel.load(savingPathWithoutSubModels)
--- End diff --

Shall we try saving cvModel2 with persistSubModels = true and check for an 
Exception?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileF...

2017-11-03 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19651
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileF...

2017-11-03 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19651
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83422/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileF...

2017-11-03 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/19651
  
Retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19653: [SPARK-22418][SQL][TEST] Add test cases for NULL Handlin...

2017-11-03 Thread mgaido91

Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/19653
  
I checked with Postgres and we have the same results of Postgres too.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19653: [SPARK-22418][SQL][TEST] Add test cases for NULL Handlin...

2017-11-03 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19653
  
**[Test build #83421 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83421/testReport)**
 for PR 19653 at commit 
[`92308a4`](https://github.com/apache/spark/commit/92308a4341849258caf549d1bcbeabd9002d3ead).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19653: [SPARK-22418][SQL][TEST] Add test cases for NULL Handlin...

2017-11-03 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19653
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19653: [SPARK-22418][SQL][TEST] Add test cases for NULL Handlin...

2017-11-03 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19653
  
Our results are exactly the same as Oracle. 

LGTM


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19653: [SPARK-22418][SQL][TEST] Add test cases for NULL ...

2017-11-03 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19653#discussion_r148903491
  
--- Diff: 
sql/core/src/test/resources/sql-tests/results/null-handling.sql.out ---
@@ -0,0 +1,305 @@
+-- Automatically generated by SQLQueryTestSuite
+-- Number of queries: 28
+
+
+-- !query 0
+create table t1(a int, b int, c int) using parquet
+-- !query 0 schema
+struct<>
+-- !query 0 output
+
+
+
+-- !query 1
+insert into t1 values(1,0,0)
+-- !query 1 schema
+struct<>
+-- !query 1 output
+
+
+
+-- !query 2
+insert into t1 values(2,0,1)
+-- !query 2 schema
+struct<>
+-- !query 2 output
+
+
+
+-- !query 3
+insert into t1 values(3,1,0)
+-- !query 3 schema
+struct<>
+-- !query 3 output
+
+
+
+-- !query 4
+insert into t1 values(4,1,1)
+-- !query 4 schema
+struct<>
+-- !query 4 output
+
+
+
+-- !query 5
+insert into t1 values(5,null,0)
+-- !query 5 schema
+struct<>
+-- !query 5 output
+
+
+
+-- !query 6
+insert into t1 values(6,null,1)
+-- !query 6 schema
+struct<>
+-- !query 6 output
+
+
+
+-- !query 7
+insert into t1 values(7,null,null)
+-- !query 7 schema
+struct<>
+-- !query 7 output
+
+
+
+-- !query 8
+select a, b+c from t1
+-- !query 8 schema
+struct
+-- !query 8 output
+1  0
+2  1
+3  1
+4  2
+5  NULL
+6  NULL
+7  NULL
+
+
+-- !query 9
+select a+10, b*0 from t1
+-- !query 9 schema
+struct<(a + 10):int,(b * 0):int>
+-- !query 9 output
+11 0
+12 0
+13 0
+14 0
+15 NULL
+16 NULL
+17 NULL
+
+
+-- !query 10
+select distinct b from t1
+-- !query 10 schema
+struct
+-- !query 10 output
+0
+1
+NULL
+
+
+-- !query 11
+select b from t1 union select b from t1
+-- !query 11 schema
+struct
+-- !query 11 output
+0
+1
+NULL
+
+
+-- !query 12
+select a+20, case b when c then 1 else 0 end from t1
+-- !query 12 schema
+struct<(a + 20):int,CASE WHEN (b = c) THEN 1 ELSE 0 END:int>
+-- !query 12 output
+21 1
+22 0
+23 0
+24 1
+25 0
+26 0
+27 0
+
+
+-- !query 13
+select a+30, case c when b then 1 else 0 end from t1
+-- !query 13 schema
+struct<(a + 30):int,CASE WHEN (c = b) THEN 1 ELSE 0 END:int>
+-- !query 13 output
+31 1
+32 0
+33 0
+34 1
+35 0
+36 0
+37 0
+
+
+-- !query 14
+select a+40, case when b<>0 then 1 else 0 end from t1
+-- !query 14 schema
+struct<(a + 40):int,CASE WHEN (NOT (b = 0)) THEN 1 ELSE 0 END:int>
+-- !query 14 output
+41 0
+42 0
+43 1
+44 1
+45 0
+46 0
+47 0
+
+
+-- !query 15
+select a+50, case when not b<>0 then 1 else 0 end from t1
+-- !query 15 schema
+struct<(a + 50):int,CASE WHEN (NOT (NOT (b = 0))) THEN 1 ELSE 0 END:int>
+-- !query 15 output
+51 1
+52 1
+53 0
+54 0
+55 0
+56 0
+57 0
+
+
+-- !query 16
+select a+60, case when b<>0 and c<>0 then 1 else 0 end from t1
+-- !query 16 schema
+struct<(a + 60):int,CASE WHEN ((NOT (b = 0)) AND (NOT (c = 0))) THEN 1 
ELSE 0 END:int>
+-- !query 16 output
+61 0
+62 0
+63 0
+64 1
+65 0
+66 0
+67 0
+
+
+-- !query 17
+select a+70, case when not (b<>0 and c<>0) then 1 else 0 end from t1
+-- !query 17 schema
+struct<(a + 70):int,CASE WHEN (NOT ((NOT (b = 0)) AND (NOT (c = 0 THEN 
1 ELSE 0 END:int>
+-- !query 17 output
+71 1
+72 1
+73 1
+74 0
+75 1
+76 0
+77 0
+
+
+-- !query 18
+select a+80, case when b<>0 or c<>0 then 1 else 0 end from t1
+-- !query 18 schema
+struct<(a + 80):int,CASE WHEN ((NOT (b = 0)) OR (NOT (c = 0))) THEN 1 ELSE 
0 END:int>
+-- !query 18 output
+81 0
+82 1
+83 1
+84 1
+85 0
+86 1
+87 0
+
+
+-- !query 19
+select a+90, case when not (b<>0 or c<>0) then 1 else 0 end from t1
+-- !query 19 schema
+struct<(a + 90):int,CASE WHEN (NOT ((NOT (b = 0)) OR (NOT (c = 0 THEN 
1 ELSE 0 END:int>
+-- !query 19 output
+91 1
+92 0
+93 0
+94 0
+95 0
+96 0
+97 0
+
+
+-- !query 20
+select count(*), count(b), sum(b), avg(b), min(b), max(b) from t1
+-- !query 20 schema

+struct
+-- !query 20 output
+7  4   2   0.5 0   1
+
+
+-- !query 21
+select a+100 from t1 where b<10
+-- !query 21 schema
+struct<(a + 100):int>
+-- !query 21 output
+101
+102
+103
+104
+
+
+-- !query 22
+select a+110 from t1 where not b>10
+-- !query 22 sc

[GitHub] spark pull request #19582: [SPARK-20644][core] Initial ground work for kvsto...

2017-11-03 Thread squito

Github user squito commented on a diff in the pull request:

https://github.com/apache/spark/pull/19582#discussion_r148901343
  
--- Diff: 
core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala ---
@@ -486,8 +580,21 @@ private[history] class FsHistoryProvider(conf: 
SparkConf, clock: Clock)
 bus.addListener(listener)
 
 replay(fileStatus, isApplicationCompleted(fileStatus), bus, 
eventsFilter)
-listener.applicationInfo.foreach(addListing)
-listing.write(LogInfo(logPath.toString(), fileStatus.getLen()))
+listener.applicationInfo.foreach { app =>
+  // Invalidate the existing UI for the reloaded app attempt, if any. 
Note that this does
+  // not remove the UI from the active list; that has to be done in 
onUIDetached, so that
+  // cleanup of files can be done in a thread-safe manner. It does 
mean the UI will remain
+  // in memory for longer than it should.
--- End diff --

it took me some time to figure how the cache & invalidation worked, mostly 
because I wasn't looking in the right places.  I don't think you've made this 
any more confusing than it was before (in fact its probably better), but seems 
like a good opportunity to improve commenting a little.  I think it might help 
to have one comment in the code where the entire sequence is described ( here 
on `mergeApplicationListing`, or on `AppCache`, or 
`ApplicationCacheCheckFilter`, doesn't really matter, but they could all 
reference the longer comment).  if I understand correctly, it would be 
something like:

Logs of incomplete apps are regularly polled to see if they have been 
updated (based on an increase in file size).  If they have, the existing data 
for that app is marked as invalid in LoadedAppUI.  However, no memory is freed, 
no files are cleaned up at this time, nor is a new UI built.  On each request 
for one app's UI, the application cache is checked  to see if it has a valid 
`LoadedAppUI` in the cache.  If there is data in the cache and its valid, then 
its served.  If there is data in the cache but it is invalid, then the UI is 
rebuilt from the raw event logs.  If there is nothing in the cache, then the UI 
is built from the raw event logs and added to the cache.  This may kick another 
entry out of the cache -- if its for an incomplete app, then any KVStore data 
written to disk is deleted (as the KVStore for an incomplete app is always 
regenerated from scratch anyway). 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19582: [SPARK-20644][core] Initial ground work for kvsto...

2017-11-03 Thread squito

Github user squito commented on a diff in the pull request:

https://github.com/apache/spark/pull/19582#discussion_r148879455
  
--- Diff: core/src/main/scala/org/apache/spark/status/storeTypes.scala ---
@@ -17,12 +17,17 @@
 
 package org.apache.spark.status
 
+import java.lang.{Integer => JInteger, Long => JLong}
+
 import com.fasterxml.jackson.annotation.JsonIgnore
 
 import org.apache.spark.status.KVUtils._
 import org.apache.spark.status.api.v1._
 import org.apache.spark.util.kvstore.KVIndex
 
+private[spark] case class AppStatusStoreMetadata(
+val version: Long)
--- End diff --

nit: `val` is unnecessary


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19583: [WIP][SPARK-22339] [CORE] [NETWORK-SHUFFLE] Push epoch u...

2017-11-03 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19583
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19582: [SPARK-20644][core] Initial ground work for kvstore UI b...

2017-11-03 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19582
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19583: [WIP][SPARK-22339] [CORE] [NETWORK-SHUFFLE] Push epoch u...

2017-11-03 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19583
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83419/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19582: [SPARK-20644][core] Initial ground work for kvstore UI b...

2017-11-03 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19582
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83420/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19653: [SPARK-22418][SQL][TEST] Add test cases for NULL Handlin...

2017-11-03 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19653
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19653: [SPARK-22418][SQL][TEST] Add test cases for NULL Handlin...

2017-11-03 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19653
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83418/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19623: [SPARK-22078][SQL] clarify exception behaviors fo...

2017-11-03 Thread steveloughran

Github user steveloughran commented on a diff in the pull request:

https://github.com/apache/spark/pull/19623#discussion_r14237
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataSourceV2Writer.java
 ---
@@ -50,28 +53,34 @@
 
   /**
* Creates a writer factory which will be serialized and sent to 
executors.
+   *
+   * If this method fails (by throwing an exception), the action would 
fail and no Spark job was
+   * submitted.
*/
   DataWriterFactory createWriterFactory();
 
   /**
* Commits this writing job with a list of commit messages. The commit 
messages are collected from
-   * successful data writers and are produced by {@link 
DataWriter#commit()}. If this method
-   * fails(throw exception), this writing job is considered to be failed, 
and
-   * {@link #abort(WriterCommitMessage[])} will be called. The written 
data should only be visible
-   * to data source readers if this method succeeds.
+   * successful data writers and are produced by {@link 
DataWriter#commit()}.
+   *
+   * If this method fails (by throwing an exception), this writing job is 
considered to to have been
+   * failed, and {@link #abort(WriterCommitMessage[])} would be called. 
The state of the destination
+   * is undefined and @{@link #abort(WriterCommitMessage[])} may not be 
able to deal with it.
*
* Note that, one partition may have multiple committed data writers 
because of speculative tasks.
* Spark will pick the first successful one and get its commit message. 
Implementations should be
--- End diff --

thinking about this some more (and still reserving the right to be provably 
wrong), I could imagine a job commit protocol which allowed 1+ Task attempt to 
commit, making the decision about which completed tasks to accept into the 
final results at job commit.

That is, 

1. every task attempt could (atomically) promote its output to the job 
commit dir
1. the job commit would enumerate all promoted task attempts, and choose 
which ones
to accept.
1. The ones it didn't want would be aborted by the job committer.

Example: task attempts rename their output to a dir 
`dest/_tempt/jobId/completed/$taskId_$task_attemptId`; job committer would enum 
all directories in the completed dir, and for all dirs where the task ID was 
the same: pick one to commit, abort the others.

With that variant, you don't need to coordinate task commit across workers, 
even with speculation enabled. You'd just need to be confident that the 
promotion of task commit information was atomic, the view of the output 
consistent, and that job commit is not initiated until at least one attempt per 
task has succeeded. Job commit is potentially slower though.
Because you can turn off use of the output co-ordinator when writing to 
hive/hadoop tables, then a commit protocol like this could work today. 
Interestingly, it's not that far off the FileOutputCommitter v1 protocol, you'd 
just renane the task attempt output dir to the promoted dir & let the job 
commit filter out duplicates from its directory listing. Be a bit more 
expensive in terms of storage use between task commit and job commit though.

Maybe a good policy here is "a job must not commit 1+ task attempt", but 
give committers the option to bypass the output coordinator. if the job 
committer can handle the conflict resolution & so make that guarantee in its 
commit phase.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19638: [SPARK-22422][ML] Add Adjusted R2 to RegressionMe...

2017-11-03 Thread sethah

Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/19638#discussion_r148881376
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala 
---
@@ -764,13 +764,17 @@ class LinearRegressionSuite
   (Intercept) 6.3022157  0.00186003388   <2e-16 ***
   V2  4.6982442  0.00118053980   <2e-16 ***
   V3  7.1994344  0.00090447961   <2e-16 ***
+
+  # R code for r2adj
--- End diff --

What is `X1`? it's never defined.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19623: [SPARK-22078][SQL] clarify exception behaviors fo...

2017-11-03 Thread rdblue

Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/19623#discussion_r148875607
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataSourceV2Writer.java
 ---
@@ -50,28 +53,34 @@
 
   /**
* Creates a writer factory which will be serialized and sent to 
executors.
+   *
+   * If this method fails (by throwing an exception), the action would 
fail and no Spark job was
+   * submitted.
*/
   DataWriterFactory createWriterFactory();
 
   /**
* Commits this writing job with a list of commit messages. The commit 
messages are collected from
-   * successful data writers and are produced by {@link 
DataWriter#commit()}. If this method
-   * fails(throw exception), this writing job is considered to be failed, 
and
-   * {@link #abort(WriterCommitMessage[])} will be called. The written 
data should only be visible
-   * to data source readers if this method succeeds.
+   * successful data writers and are produced by {@link 
DataWriter#commit()}.
+   *
+   * If this method fails (by throwing an exception), this writing job is 
considered to to have been
+   * failed, and {@link #abort(WriterCommitMessage[])} would be called. 
The state of the destination
+   * is undefined and @{@link #abort(WriterCommitMessage[])} may not be 
able to deal with it.
*
* Note that, one partition may have multiple committed data writers 
because of speculative tasks.
* Spark will pick the first successful one and get its commit message. 
Implementations should be
--- End diff --

I haven't read Steve's points here entirely, but I agree that Spark should 
be primarily responsible for task commit coordination. Most implementations 
would be fine using the current [output commit 
coordinator](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala),
 which does a good job balancing the trade-offs that you've been discussing. It 
ensures that only one task is authorized to commit and has well-defined failure 
cases (when a network partition prevents the authorized committer from 
responding before its commit authorization times out).

I think that Spark should use the current commit coordinator unless an 
implementation opts out of using it (and I'm not sure that opting out is a use 
case we care to support at this point). It's fine if Spark documents how its 
coordinator works and there are some drawbacks, but expecting implementations 
to handle their own commit coordination (which requires RPC for Spark) is, I 
think, unreasonable. Let's use the one we have by default, however imperfect.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19634: [SPARK-22412][SQL] Fix incorrect comment in DataS...

2017-11-03 Thread srowen

Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/19634#discussion_r148875065
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala 
---
@@ -469,7 +469,7 @@ case class FileSourceScanExec(
   currentSize = 0
 }
 
-// Assign files to partitions using "First Fit Decreasing" (FFD)
+// Assign files to partitions using "Next Fit Decreasing"
--- End diff --

@liancheng do you agree?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19208: [SPARK-21087] [ML] CrossValidator, TrainValidatio...

2017-11-03 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/19208#discussion_r148859381
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala ---
@@ -252,19 +252,29 @@ object CrossValidator extends 
MLReadable[CrossValidator] {
 class CrossValidatorModel private[ml] (
 @Since("1.4.0") override val uid: String,
 @Since("1.2.0") val bestModel: Model[_],
-@Since("1.5.0") val avgMetrics: Array[Double],
-@Since("2.3.0") val subModels: Option[Array[Array[Model[_)
+@Since("1.5.0") val avgMetrics: Array[Double])
   extends Model[CrossValidatorModel] with CrossValidatorParams with 
MLWritable {
 
   /** A Python-friendly auxiliary constructor. */
   private[ml] def this(uid: String, bestModel: Model[_], avgMetrics: 
JList[Double]) = {
-this(uid, bestModel, avgMetrics.asScala.toArray, null)
+this(uid, bestModel, avgMetrics.asScala.toArray)
   }
 
-  private[ml] def this(uid: String, bestModel: Model[_], avgMetrics: 
Array[Double]) = {
-this(uid, bestModel, avgMetrics, null)
+  private var _subModels: Option[Array[Array[Model[_ = None
+
+  @Since("2.3.0")
+  private[tuning] def setSubModels(subModels: 
Option[Array[Array[Model[_)
+: CrossValidatorModel = {
+_subModels = subModels
+this
   }
 
+  @Since("2.3.0")
+  def subModels: Array[Array[Model[_]]] = _subModels.get
--- End diff --

Also, can you please add a better Exception message?  If submodels are not 
available, then we should tell users to set the collectSubModels Param before 
fitting.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19208: [SPARK-21087] [ML] CrossValidator, TrainValidatio...

2017-11-03 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/19208#discussion_r148859543
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala ---
@@ -325,14 +328,19 @@ object CrossValidatorModel extends 
MLReadable[CrossValidatorModel] {
 
 ValidatorParams.validateParams(instance)
 
-protected var shouldPersistSubModels: Boolean = false
+protected var shouldPersistSubModels: Boolean = if 
(instance.hasSubModels) true else false
 
 /**
- * Set option for persist sub models.
+ * Extra options for CrossValidatorModelWriter, current support 
"persistSubModels".
+ * if sub models exsit, the default value for option 
"persistSubModels" is "true".
--- End diff --

typo: exsit -> exist


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19208: [SPARK-21087] [ML] CrossValidator, TrainValidatio...

2017-11-03 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/19208#discussion_r148859043
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala ---
@@ -252,19 +252,29 @@ object CrossValidator extends 
MLReadable[CrossValidator] {
 class CrossValidatorModel private[ml] (
 @Since("1.4.0") override val uid: String,
 @Since("1.2.0") val bestModel: Model[_],
-@Since("1.5.0") val avgMetrics: Array[Double],
-@Since("2.3.0") val subModels: Option[Array[Array[Model[_)
+@Since("1.5.0") val avgMetrics: Array[Double])
   extends Model[CrossValidatorModel] with CrossValidatorParams with 
MLWritable {
 
   /** A Python-friendly auxiliary constructor. */
   private[ml] def this(uid: String, bestModel: Model[_], avgMetrics: 
JList[Double]) = {
-this(uid, bestModel, avgMetrics.asScala.toArray, null)
+this(uid, bestModel, avgMetrics.asScala.toArray)
   }
 
-  private[ml] def this(uid: String, bestModel: Model[_], avgMetrics: 
Array[Double]) = {
-this(uid, bestModel, avgMetrics, null)
+  private var _subModels: Option[Array[Array[Model[_ = None
+
+  @Since("2.3.0")
--- End diff --

Only use Since annotations for public APIs


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19208: [SPARK-21087] [ML] CrossValidator, TrainValidatio...

2017-11-03 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/19208#discussion_r148859168
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala ---
@@ -252,19 +252,29 @@ object CrossValidator extends 
MLReadable[CrossValidator] {
 class CrossValidatorModel private[ml] (
 @Since("1.4.0") override val uid: String,
 @Since("1.2.0") val bestModel: Model[_],
-@Since("1.5.0") val avgMetrics: Array[Double],
-@Since("2.3.0") val subModels: Option[Array[Array[Model[_)
+@Since("1.5.0") val avgMetrics: Array[Double])
   extends Model[CrossValidatorModel] with CrossValidatorParams with 
MLWritable {
 
   /** A Python-friendly auxiliary constructor. */
   private[ml] def this(uid: String, bestModel: Model[_], avgMetrics: 
JList[Double]) = {
-this(uid, bestModel, avgMetrics.asScala.toArray, null)
+this(uid, bestModel, avgMetrics.asScala.toArray)
   }
 
-  private[ml] def this(uid: String, bestModel: Model[_], avgMetrics: 
Array[Double]) = {
-this(uid, bestModel, avgMetrics, null)
+  private var _subModels: Option[Array[Array[Model[_ = None
+
+  @Since("2.3.0")
+  private[tuning] def setSubModels(subModels: 
Option[Array[Array[Model[_)
+: CrossValidatorModel = {
+_subModels = subModels
+this
   }
 
+  @Since("2.3.0")
+  def subModels: Array[Array[Model[_]]] = _subModels.get
--- End diff --

Let's add Scala doc.  We'll need to explain what the inner and outer array 
are and which one corresponds to the ordering of estimatorParamsMaps.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19208: [SPARK-21087] [ML] CrossValidator, TrainValidatio...

2017-11-03 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/19208#discussion_r148860542
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala ---
@@ -108,6 +108,13 @@ abstract class MLWriter extends BaseReadWrite with 
Logging {
   protected def saveImpl(path: String): Unit
 
   /**
+   * `option()` handles extra options. If subclasses need to support extra 
options, override this
+   * method.
+   */
+  @Since("2.3.0")
+  def option(key: String, value: String): this.type = this
--- End diff --

Rather than overriding this in each subclass, let's have this option() 
method collect the specified options in a map which is consumed by the subclass 
when saveImpl() is called.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19638: [SPARK-22422][ML] Add Adjusted R2 to RegressionMe...

2017-11-03 Thread tengpeng

Github user tengpeng commented on a diff in the pull request:

https://github.com/apache/spark/pull/19638#discussion_r148869278
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala 
---
@@ -764,13 +764,17 @@ class LinearRegressionSuite
   (Intercept) 6.3022157  0.00186003388   <2e-16 ***
   V2  4.6982442  0.00118053980   <2e-16 ***
   V3  7.1994344  0.00090447961   <2e-16 ***
+
+  # R code for r2adj
--- End diff --

I just want to confirm: is `summary(lm(X1 ~ ., data = part_0))` this 
does not work on your side? Is it because you used different variable name or 
data name?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-11-03 Thread mallman

Github user mallman commented on the issue:

https://github.com/apache/spark/pull/16578
  
@viirya Can you please take a look at my latest revisions and replies to 
your comments? Cheers.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-11-03 Thread mallman

Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/16578#discussion_r148866084
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala
 ---
@@ -63,9 +74,22 @@ private[parquet] class ParquetReadSupport extends 
ReadSupport[UnsafeRow] with Lo
   StructType.fromString(schemaString)
 }
 
-val parquetRequestedSchema =
+val clippedParquetSchema =
   ParquetReadSupport.clipParquetSchema(context.getFileSchema, 
catalystRequestedSchema)
 
+val parquetRequestedSchema = if (parquetMrCompatibility) {
+  // Parquet-mr will throw an exception if we try to read a superset 
of the file's schema.
+  // Therefore, we intersect our clipped schema with the underlying 
file's schema
--- End diff --

On the other hand, if we fix `parquetMrCompatibility` to `true`, then a 
couple of other tests fail. Namely, these tests are

"Filter applied on merged Parquet schema with new column should work" in 
`ParquetFilterSuite.scala`
"read partitioned table - merging compatible schemas" in 
`ParquetPartitionDiscoverySuite.scala`

In both cases, the failures involve queries over multiple files with 
compatible but different schema.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19653: [SPARK-22418][SQL][TEST] Add test cases for NULL ...

2017-11-03 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19653#discussion_r148865915
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/NullHandlingSuite.scala ---
@@ -0,0 +1,130 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql
+
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.test.SharedSQLContext
+
+case class T1(a: Int, b: Option[Int], c: Option[Int])
+
+/**
+ * This test suite takes https://sqlite.org/nulls.html as a reference.
+ */
+class NullHandlingSuite extends QueryTest with SharedSQLContext {
--- End diff --

Maybe using SQL? It will be easier for the others who knew SQL only to 
understand our NULL handling logics. Thanks again!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-11-03 Thread mallman

Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/16578#discussion_r148863673
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala
 ---
@@ -63,9 +74,22 @@ private[parquet] class ParquetReadSupport extends 
ReadSupport[UnsafeRow] with Lo
   StructType.fromString(schemaString)
 }
 
-val parquetRequestedSchema =
+val clippedParquetSchema =
   ParquetReadSupport.clipParquetSchema(context.getFileSchema, 
catalystRequestedSchema)
 
+val parquetRequestedSchema = if (parquetMrCompatibility) {
+  // Parquet-mr will throw an exception if we try to read a superset 
of the file's schema.
+  // Therefore, we intersect our clipped schema with the underlying 
file's schema
--- End diff --

As for the problem of requesting a read of a superset of a file's fields, 
if we disable the `parquetMrCompatibility` code, then the "partial schema 
intersection - select missing subfield" test in 
`ParquetSchemaPruningSuite.scala` fails with the following stack trace:

```
[info]   Cause: org.apache.parquet.io.ParquetDecodingException: Can not 
read value at 0 in block -1 in file 
file:/Volumes/VideoAmpCS/msa/workspace/spark-public/target/tmp/spark-a0bda193-9d3f-4cd1-885c-9e8b5b0fc1ed/contacts/p=2/part-1-4a8671f1-afb2-482f-8c4d-4f6f4df896bc-c000.snappy.parquet
[info]   at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:223)
[info]   at 
org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:215)
[info]   at 
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
[info]   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
[info]   at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:106)
[info]   at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:182)
[info]   at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:106)
[info]   at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
[info]   at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
[info]   at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$7$$anon$1.hasNext(WholeStageCodegenExec.scala:432)
[info]   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
[info]   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
[info]   at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1820)
[info]   at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1160)
[info]   at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1160)
[info]   at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2064)
[info]   at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2064)
[info]   at 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
[info]   at org.apache.spark.scheduler.Task.run(Task.scala:108)
[info]   at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
[info]   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[info]   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[info]   at java.lang.Thread.run(Thread.java:748)
[info]   Cause: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
[info]   at java.util.ArrayList.rangeCheck(ArrayList.java:653)
[info]   at java.util.ArrayList.get(ArrayList.java:429)
[info]   at 
org.apache.parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:103)
[info]   at 
org.apache.parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:103)
[info]   at 
org.apache.parquet.io.PrimitiveColumnIO.getFirst(PrimitiveColumnIO.java:102)
[info]   at 
org.apache.parquet.io.PrimitiveColumnIO.isFirst(PrimitiveColumnIO.java:97)
[info]   at 
org.apache.parquet.io.RecordReaderImplementation.(RecordReaderImplementation.java:278)
[info]   at 
org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:141)
[info]   at 
org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:107)
[info]   at 
org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:155)
[info]   at 
org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:107)
[info]   at 
org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:136)
[info]   at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:194)
[info]   at 
org.apache.parquet.hadoop

[GitHub] spark pull request #19648: [SPARK-14516][ML][FOLLOW-UP] Move ClusteringEvalu...

2017-11-03 Thread yanboliang

Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/19648#discussion_r148861446
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/evaluation/ClusteringEvaluatorSuite.scala
 ---
@@ -22,15 +22,21 @@ import org.apache.spark.ml.param.ParamsSuite
 import org.apache.spark.ml.util.DefaultReadWriteTest
 import org.apache.spark.ml.util.TestingUtils._
 import org.apache.spark.mllib.util.MLlibTestSparkContext
-import org.apache.spark.sql.{DataFrame, SparkSession}
-import org.apache.spark.sql.types.IntegerType
+import org.apache.spark.sql.Dataset
 
 
 class ClusteringEvaluatorSuite
   extends SparkFunSuite with MLlibTestSparkContext with 
DefaultReadWriteTest {
 
   import testImplicits._
 
+  @transient var irisDataset: Dataset[_] = _
--- End diff --

Either is ok, no difference on performance, we just keep consistent with 
other test suites. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19208: [SPARK-21087] [ML] CrossValidator, TrainValidatio...

2017-11-03 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/19208#discussion_r148857274
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tuning/TrainValidationSplit.scala ---
@@ -276,12 +315,32 @@ object TrainValidationSplitModel extends 
MLReadable[TrainValidationSplitModel] {
 
 ValidatorParams.validateParams(instance)
 
+protected var shouldPersistSubModels: Boolean = false
+
+/**
+ * Set option for persist sub models.
+ */
+@Since("2.3.0")
+def persistSubModels(persist: Boolean): this.type = {
+  shouldPersistSubModels = persist
+  this
+}
+
 override protected def saveImpl(path: String): Unit = {
   import org.json4s.JsonDSL._
-  val extraMetadata = "validationMetrics" -> 
instance.validationMetrics.toSeq
+  val extraMetadata = ("validationMetrics" -> 
instance.validationMetrics.toSeq) ~
+("shouldPersistSubModels" -> shouldPersistSubModels)
   ValidatorParams.saveImpl(path, instance, sc, Some(extraMetadata))
   val bestModelPath = new Path(path, "bestModel").toString
   instance.bestModel.asInstanceOf[MLWritable].save(bestModelPath)
+  if (shouldPersistSubModels) {
+require(instance.subModels != null, "Cannot get sub models to 
persist.")
+val subModelsPath = new Path(path, "subModels")
+for (paramIndex <- 0 until instance.getEstimatorParamMaps.length) {
+  val modelPath = new Path(subModelsPath, 
paramIndex.toString).toString
+  
instance.subModels(paramIndex).asInstanceOf[MLWritable].save(modelPath)
--- End diff --

Good question about cleaning up partially saved models.  I agree it'd be 
nice to do in the future, rather than now.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19641: [SPARK-21911][ML][FOLLOW-UP] Fix doc for parallel ML Tun...

2017-11-03 Thread jkbradley

Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/19641
  
LGTM pending tests


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 4 >

1 - 100 of 320 matches

Mail list logo