[GitHub] spark issue #21102: [SPARK-23913][SQL] Add array_intersect function

2018-07-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21102
  
**[Test build #93740 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93740/testReport)**
 for PR 21102 at commit 
[`c398291`](https://github.com/apache/spark/commit/c3982911d6195fc1bc1c63d72d4b3273f958accd).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21102: [SPARK-23913][SQL] Add array_intersect function

2018-07-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21102
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1455/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21102: [SPARK-23913][SQL] Add array_intersect function

2018-07-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21102
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21488: [SPARK-18057][SS] Update Kafka client version from 0.10....

2018-07-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21488
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93736/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21488: [SPARK-18057][SS] Update Kafka client version from 0.10....

2018-07-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21488
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21488: [SPARK-18057][SS] Update Kafka client version from 0.10....

2018-07-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21488
  
**[Test build #93736 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93736/testReport)**
 for PR 21488 at commit 
[`aa69915`](https://github.com/apache/spark/commit/aa69915165d9aaca2bcb5d22fb2fc9467bf16826).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21103: [SPARK-23915][SQL] Add array_except function

2018-07-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21103
  
**[Test build #93739 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93739/testReport)**
 for PR 21103 at commit 
[`3639b5b`](https://github.com/apache/spark/commit/3639b5b1b460c19b7c3a787d118a0bcf35cb8e23).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21103: [SPARK-23915][SQL] Add array_except function

2018-07-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21103
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1454/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21103: [SPARK-23915][SQL] Add array_except function

2018-07-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21103
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21898: [SPARK-24817][Core] Implement BarrierTaskContext....

2018-07-28 Thread jiangxb1987
Github user jiangxb1987 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21898#discussion_r205961043
  
--- Diff: core/src/main/scala/org/apache/spark/BarrierTaskContextImpl.scala 
---
@@ -39,8 +44,51 @@ private[spark] class BarrierTaskContextImpl(
   taskMemoryManager, localProperties, metricsSystem, taskMetrics)
 with BarrierTaskContext {
 
-  // TODO SPARK-24817 implement global barrier.
-  override def barrier(): Unit = {}
+  private val barrierCoordinator: RpcEndpointRef = {
+val env = SparkEnv.get
+RpcUtils.makeDriverRef("barrierSync", env.conf, env.rpcEnv)
+  }
+
+  private val timer = new Timer("Barrier task timer for barrier() calls.")
+
+  private var barrierEpoch = 0
+
+  private lazy val numTasks = localProperties.getProperty("numTasks", 
"0").toInt
+
+  override def barrier(): Unit = {
+logInfo(s"Task $taskAttemptId from Stage $stageId(Attempt 
$stageAttemptNumber) has entered " +
+  s"the global sync, current barrier epoch is $barrierEpoch.")
+
+val startTime = System.currentTimeMillis()
+val timerTask = new TimerTask {
+  override def run(): Unit = {
+logInfo(s"Task $taskAttemptId from Stage $stageId(Attempt 
$stageAttemptNumber) waiting " +
+  s"under the global sync since $startTime, has been waiting for " 
+
+  s"${(System.currentTimeMillis() - startTime) / 1000} seconds, 
current barrier epoch " +
+  s"is $barrierEpoch.")
+  }
+}
+// Log the update of global sync every 60 seconds.
+timer.schedule(timerTask, 6, 6)
+
+try {
+  barrierCoordinator.askSync[Unit](
+message = RequestToSync(numTasks, stageId, stageAttemptNumber, 
taskAttemptId, barrierEpoch),
+timeout = new RpcTimeout(31536000 /** = 3600 * 24 * 365 */ 
seconds, "barrierTimeout"))
+  barrierEpoch += 1
+  logInfo(s"Task $taskAttemptId from Stage $stageId(Attempt 
$stageAttemptNumber) finished " +
+"global sync successfully, waited for " +
+s"${(System.currentTimeMillis() - startTime) / 1000} seconds, 
current barrier epoch is " +
+s"$barrierEpoch.")
--- End diff --

Nice catch! just updated.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21898: [SPARK-24817][Core] Implement BarrierTaskContext.barrier...

2018-07-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21898
  
**[Test build #93738 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93738/testReport)**
 for PR 21898 at commit 
[`766381d`](https://github.com/apache/spark/commit/766381d18174135d922a19172328268640a09f4c).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21898: [SPARK-24817][Core] Implement BarrierTaskContext.barrier...

2018-07-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21898
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1453/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21898: [SPARK-24817][Core] Implement BarrierTaskContext.barrier...

2018-07-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21898
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21898: [SPARK-24817][Core] Implement BarrierTaskContext....

2018-07-28 Thread jiangxb1987
Github user jiangxb1987 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21898#discussion_r205960887
  
--- Diff: core/src/main/scala/org/apache/spark/BarrierTaskContextImpl.scala 
---
@@ -39,8 +44,51 @@ private[spark] class BarrierTaskContextImpl(
   taskMemoryManager, localProperties, metricsSystem, taskMetrics)
 with BarrierTaskContext {
 
-  // TODO SPARK-24817 implement global barrier.
-  override def barrier(): Unit = {}
+  private val barrierCoordinator: RpcEndpointRef = {
+val env = SparkEnv.get
+RpcUtils.makeDriverRef("barrierSync", env.conf, env.rpcEnv)
+  }
+
+  private val timer = new Timer("Barrier task timer for barrier() calls.")
+
+  private var barrierEpoch = 0
+
+  private lazy val numTasks = localProperties.getProperty("numTasks", 
"0").toInt
+
+  override def barrier(): Unit = {
+logInfo(s"Task $taskAttemptId from Stage $stageId(Attempt 
$stageAttemptNumber) has entered " +
+  s"the global sync, current barrier epoch is $barrierEpoch.")
+
+val startTime = System.currentTimeMillis()
+val timerTask = new TimerTask {
+  override def run(): Unit = {
+logInfo(s"Task $taskAttemptId from Stage $stageId(Attempt 
$stageAttemptNumber) waiting " +
+  s"under the global sync since $startTime, has been waiting for " 
+
+  s"${(System.currentTimeMillis() - startTime) / 1000} seconds, 
current barrier epoch " +
+  s"is $barrierEpoch.")
+  }
+}
+// Log the update of global sync every 60 seconds.
+timer.schedule(timerTask, 6, 6)
+
+try {
+  barrierCoordinator.askSync[Unit](
+message = RequestToSync(numTasks, stageId, stageAttemptNumber, 
taskAttemptId, barrierEpoch),
+timeout = new RpcTimeout(31536000 /** = 3600 * 24 * 365 */ 
seconds, "barrierTimeout"))
--- End diff --

I set a fix timeout for RPC intentionally, so users shall get a 
SparkException thrown by BarrierCoordinator, instead of RPCTimeoutException 
from the RPC framework.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21898: [SPARK-24817][Core] Implement BarrierTaskContext....

2018-07-28 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/21898#discussion_r205960624
  
--- Diff: core/src/main/scala/org/apache/spark/BarrierTaskContextImpl.scala 
---
@@ -39,8 +44,51 @@ private[spark] class BarrierTaskContextImpl(
   taskMemoryManager, localProperties, metricsSystem, taskMetrics)
 with BarrierTaskContext {
 
-  // TODO SPARK-24817 implement global barrier.
-  override def barrier(): Unit = {}
+  private val barrierCoordinator: RpcEndpointRef = {
+val env = SparkEnv.get
+RpcUtils.makeDriverRef("barrierSync", env.conf, env.rpcEnv)
+  }
+
+  private val timer = new Timer("Barrier task timer for barrier() calls.")
+
+  private var barrierEpoch = 0
+
+  private lazy val numTasks = localProperties.getProperty("numTasks", 
"0").toInt
+
+  override def barrier(): Unit = {
+logInfo(s"Task $taskAttemptId from Stage $stageId(Attempt 
$stageAttemptNumber) has entered " +
+  s"the global sync, current barrier epoch is $barrierEpoch.")
+
+val startTime = System.currentTimeMillis()
+val timerTask = new TimerTask {
+  override def run(): Unit = {
+logInfo(s"Task $taskAttemptId from Stage $stageId(Attempt 
$stageAttemptNumber) waiting " +
+  s"under the global sync since $startTime, has been waiting for " 
+
+  s"${(System.currentTimeMillis() - startTime) / 1000} seconds, 
current barrier epoch " +
+  s"is $barrierEpoch.")
+  }
+}
+// Log the update of global sync every 60 seconds.
+timer.schedule(timerTask, 6, 6)
+
+try {
+  barrierCoordinator.askSync[Unit](
+message = RequestToSync(numTasks, stageId, stageAttemptNumber, 
taskAttemptId, barrierEpoch),
+timeout = new RpcTimeout(31536000 /** = 3600 * 24 * 365 */ 
seconds, "barrierTimeout"))
--- End diff --

Use `BARRIER_SYNC_TIMEOUT` here?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21898: [SPARK-24817][Core] Implement BarrierTaskContext....

2018-07-28 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/21898#discussion_r205960662
  
--- Diff: core/src/main/scala/org/apache/spark/BarrierTaskContextImpl.scala 
---
@@ -39,8 +44,51 @@ private[spark] class BarrierTaskContextImpl(
   taskMemoryManager, localProperties, metricsSystem, taskMetrics)
 with BarrierTaskContext {
 
-  // TODO SPARK-24817 implement global barrier.
-  override def barrier(): Unit = {}
+  private val barrierCoordinator: RpcEndpointRef = {
+val env = SparkEnv.get
+RpcUtils.makeDriverRef("barrierSync", env.conf, env.rpcEnv)
+  }
+
+  private val timer = new Timer("Barrier task timer for barrier() calls.")
+
+  private var barrierEpoch = 0
+
+  private lazy val numTasks = localProperties.getProperty("numTasks", 
"0").toInt
+
+  override def barrier(): Unit = {
+logInfo(s"Task $taskAttemptId from Stage $stageId(Attempt 
$stageAttemptNumber) has entered " +
+  s"the global sync, current barrier epoch is $barrierEpoch.")
+
+val startTime = System.currentTimeMillis()
+val timerTask = new TimerTask {
+  override def run(): Unit = {
+logInfo(s"Task $taskAttemptId from Stage $stageId(Attempt 
$stageAttemptNumber) waiting " +
+  s"under the global sync since $startTime, has been waiting for " 
+
+  s"${(System.currentTimeMillis() - startTime) / 1000} seconds, 
current barrier epoch " +
+  s"is $barrierEpoch.")
+  }
+}
+// Log the update of global sync every 60 seconds.
+timer.schedule(timerTask, 6, 6)
+
+try {
+  barrierCoordinator.askSync[Unit](
+message = RequestToSync(numTasks, stageId, stageAttemptNumber, 
taskAttemptId, barrierEpoch),
+timeout = new RpcTimeout(31536000 /** = 3600 * 24 * 365 */ 
seconds, "barrierTimeout"))
+  barrierEpoch += 1
+  logInfo(s"Task $taskAttemptId from Stage $stageId(Attempt 
$stageAttemptNumber) finished " +
+"global sync successfully, waited for " +
+s"${(System.currentTimeMillis() - startTime) / 1000} seconds, 
current barrier epoch is " +
+s"$barrierEpoch.")
--- End diff --

Shall we stop `timer` for this epoch here if the global sync finished 
successfully?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21898: [SPARK-24817][Core] Implement BarrierTaskContext....

2018-07-28 Thread jiangxb1987
Github user jiangxb1987 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21898#discussion_r205960688
  
--- Diff: core/src/main/scala/org/apache/spark/BarrierTaskContext.scala ---
@@ -27,6 +27,33 @@ trait BarrierTaskContext extends TaskContext {
* Sets a global barrier and waits until all tasks in this stage hit 
this barrier. Similar to
* MPI_Barrier function in MPI, the barrier() function call blocks until 
all tasks in the same
* stage have reached this routine.
+   *
+   * This function is expected to be called by EVERY tasks in the same 
barrier stage in the SAME
+   * pattern, otherwise you may get a SparkException. Some examples of 
misuses listed below:
+   * 1. Only call barrier() function on a subset of all the tasks in the 
same barrier stage, it
+   * shall lead to time out of the function call.
+   * rdd.barrier().mapPartitions { (iter, context) =>
+   * if (context.partitionId() == 0) {
+   * // Do nothing.
+   * } else {
+   * context.barrier()
+   * }
+   * iter
+   * }
+   *
+   * 2. Include barrier() function in a try-catch code block, this may 
lead to mismatched call of
+   * barrier function exception.
+   * rdd.barrier().mapPartitions { (iter, context) =>
+   * try {
+   * // Do something that might throw an Exception.
+   * doSomething()
+   * context.barrier()
+   * } catch {
+   * case e: Exception => logWarning("...", e)
+   * }
+   * context.barrier()
--- End diff --

This is to demonstrate that in one task there is only one call of 
`barrier()` while in others there may be two calls of `barrier()`. Please refer 
to `BarrierTaskContextSuite`.`"throw exception if barrier() call mismatched"`. 
However, I'm still considering what shall be the most proper behavior for this 
scenario.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21898: [SPARK-24817][Core] Implement BarrierTaskContext....

2018-07-28 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/21898#discussion_r205960497
  
--- Diff: core/src/main/scala/org/apache/spark/BarrierTaskContext.scala ---
@@ -27,6 +27,33 @@ trait BarrierTaskContext extends TaskContext {
* Sets a global barrier and waits until all tasks in this stage hit 
this barrier. Similar to
* MPI_Barrier function in MPI, the barrier() function call blocks until 
all tasks in the same
* stage have reached this routine.
+   *
+   * This function is expected to be called by EVERY tasks in the same 
barrier stage in the SAME
+   * pattern, otherwise you may get a SparkException. Some examples of 
misuses listed below:
+   * 1. Only call barrier() function on a subset of all the tasks in the 
same barrier stage, it
+   * shall lead to time out of the function call.
+   * rdd.barrier().mapPartitions { (iter, context) =>
+   * if (context.partitionId() == 0) {
+   * // Do nothing.
+   * } else {
+   * context.barrier()
+   * }
+   * iter
+   * }
+   *
+   * 2. Include barrier() function in a try-catch code block, this may 
lead to mismatched call of
+   * barrier function exception.
+   * rdd.barrier().mapPartitions { (iter, context) =>
+   * try {
+   * // Do something that might throw an Exception.
+   * doSomething()
+   * context.barrier()
+   * } catch {
+   * case e: Exception => logWarning("...", e)
+   * }
+   * context.barrier()
--- End diff --

I guess here should not be another `context.barrier()`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21898: [SPARK-24817][Core] Implement BarrierTaskContext.barrier...

2018-07-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21898
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1452/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21898: [SPARK-24817][Core] Implement BarrierTaskContext.barrier...

2018-07-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21898
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21898: [SPARK-24817][Core] Implement BarrierTaskContext.barrier...

2018-07-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21898
  
**[Test build #93737 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93737/testReport)**
 for PR 21898 at commit 
[`01c274b`](https://github.com/apache/spark/commit/01c274b40dfdd601b75bb5c19261fbc732c7763b).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21895: [SPARK-24948][SHS] Delegate check access permissions to ...

2018-07-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21895
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93735/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21895: [SPARK-24948][SHS] Delegate check access permissions to ...

2018-07-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21895
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21895: [SPARK-24948][SHS] Delegate check access permissions to ...

2018-07-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21895
  
**[Test build #93735 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93735/testReport)**
 for PR 21895 at commit 
[`2ad5285`](https://github.com/apache/spark/commit/2ad52858ba9d17e26e6d7ea11658515af7a02a03).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21102: [SPARK-23913][SQL] Add array_intersect function

2018-07-28 Thread kiszk
Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/21102#discussion_r205959224
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
 ---
@@ -3968,3 +3964,234 @@ object ArrayUnion {
 new GenericArrayData(arrayBuffer)
   }
 }
+
+/**
+ * Returns an array of the elements in the intersect of x and y, without 
duplicates
+ */
+@ExpressionDescription(
+  usage = """
+  _FUNC_(array1, array2) - Returns an array of the elements in the 
intersection of array1 and
+array2, without duplicates.
+  """,
+  examples = """
+Examples:Fun
+  > SELECT _FUNC_(array(1, 2, 3), array(1, 3, 5));
+   array(1, 3)
+  """,
+  since = "2.4.0")
+case class ArrayIntersect(left: Expression, right: Expression) extends 
ArraySetLike {
+  override def dataType: DataType = ArrayType(elementType,
+left.dataType.asInstanceOf[ArrayType].containsNull &&
+  right.dataType.asInstanceOf[ArrayType].containsNull)
+
+  @transient lazy val evalIntersect: (ArrayData, ArrayData) => ArrayData = 
{
+if (elementTypeSupportEquals) {
+  (array1, array2) =>
+val hs = new OpenHashSet[Any]
+val hsResult = new OpenHashSet[Any]
+var foundNullElement = false
+var i = 0
+while (i < array2.numElements()) {
+  if (array2.isNullAt(i)) {
+foundNullElement = true
+  } else {
+val elem = array2.get(i, elementType)
+hs.add(elem)
+  }
+  i += 1
+}
+val arrayBuffer = new scala.collection.mutable.ArrayBuffer[Any]
+i = 0
+while (i < array1.numElements()) {
+  if (array1.isNullAt(i)) {
+if (foundNullElement) {
+  arrayBuffer += null
+  foundNullElement = false
+}
+  } else {
+val elem = array1.get(i, elementType)
+if (hs.contains(elem) && !hsResult.contains(elem)) {
+  arrayBuffer += elem
+  hsResult.add(elem)
+}
+  }
+  i += 1
+}
+new GenericArrayData(arrayBuffer)
+} else {
+  (array1, array2) =>
+val arrayBuffer = new scala.collection.mutable.ArrayBuffer[Any]
+var alreadySeenNull = false
+var i = 0
+while (i < array1.numElements()) {
+  var found = false
+  val elem1 = array1.get(i, elementType)
+  if (array1.isNullAt(i)) {
+if (!alreadySeenNull) {
+  var j = 0
+  while (!found && j < array2.numElements()) {
+found = array2.isNullAt(j)
+j += 1
+  }
+  // array2 is scanned only once for null element
+  alreadySeenNull = true
+}
+  } else {
+var j = 0
+while (!found && j < array2.numElements()) {
+  if (!array2.isNullAt(j)) {
+val elem2 = array2.get(j, elementType)
+if (ordering.equiv(elem1, elem2)) {
+  // check whether elem1 is already stored in arrayBuffer
+  var foundArrayBuffer = false
+  var k = 0
+  while (!foundArrayBuffer && k < arrayBuffer.size) {
+val va = arrayBuffer(k)
+foundArrayBuffer = (va != null) && ordering.equiv(va, 
elem1)
+k += 1
+  }
+  found = !foundArrayBuffer
+}
+  }
+  j += 1
+}
+  }
+  if (found) {
+arrayBuffer += elem1
+  }
+  i += 1
+}
+new GenericArrayData(arrayBuffer)
+}
+  }
+
+  override def nullSafeEval(input1: Any, input2: Any): Any = {
+val array1 = input1.asInstanceOf[ArrayData]
+val array2 = input2.asInstanceOf[ArrayData]
+
+evalIntersect(array1, array2)
+  }
+
+  override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
+val arrayData = classOf[ArrayData].getName
+val i = ctx.freshName("i")
--- End diff --

It would be good to refactor as a method from L4077 to L4124 since this 
part can be used among `union`, `except`, and `intersect`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21904: [SPARK-24953] [SQL] Prune a branch in `CaseWhen` ...

2018-07-28 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/21904#discussion_r205958402
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala
 ---
@@ -416,6 +416,29 @@ object SimplifyConditionals extends Rule[LogicalPlan] 
with PredicateHelper {
 // these branches can be pruned away
 val (h, t) = branches.span(_._1 != TrueLiteral)
 CaseWhen( h :+ t.head, None)
+
+  case e @ CaseWhen(branches, _) =>
+val newBranches = branches.foldLeft(List[(Expression, 
Expression)]()) {
+  case (newBranches, branch) =>
+if (newBranches.exists(_._1.semanticEquals(branch._1))) {
+  // If a condition in a branch is previously seen, this 
branch can be pruned.
+  // TODO: In fact, if a condition is a sub-condition of the 
previous one,
+  // TODO: it can be pruned. This is less strict and can be 
implemented
+  // TODO: by decomposing seen conditions.
+  newBranches
--- End diff --

This seems good as the branch is useless. Removing it should simplify code 
and query plan.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21904: [SPARK-24953] [SQL] Prune a branch in `CaseWhen` ...

2018-07-28 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/21904#discussion_r205958393
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala
 ---
@@ -416,6 +416,29 @@ object SimplifyConditionals extends Rule[LogicalPlan] 
with PredicateHelper {
 // these branches can be pruned away
 val (h, t) = branches.span(_._1 != TrueLiteral)
 CaseWhen( h :+ t.head, None)
+
+  case e @ CaseWhen(branches, _) =>
+val newBranches = branches.foldLeft(List[(Expression, 
Expression)]()) {
+  case (newBranches, branch) =>
+if (newBranches.exists(_._1.semanticEquals(branch._1))) {
+  // If a condition in a branch is previously seen, this 
branch can be pruned.
+  // TODO: In fact, if a condition is a sub-condition of the 
previous one,
+  // TODO: it can be pruned. This is less strict and can be 
implemented
+  // TODO: by decomposing seen conditions.
+  newBranches
+} else if (newBranches.nonEmpty && 
newBranches.last._2.semanticEquals(branch._2)) {
+  // If the outputs of two adjacent branches are the same, two 
branches can be combined.
+  newBranches.take(newBranches.length - 1)
+.:+((Or(newBranches.last._1, branch._1), 
newBranches.last._2))
--- End diff --

Can this provide any benefit?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21488: [SPARK-18057][SS] Update Kafka client version from 0.10....

2018-07-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21488
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1451/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21488: [SPARK-18057][SS] Update Kafka client version from 0.10....

2018-07-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21488
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21488: [SPARK-18057][SS] Update Kafka client version from 0.10....

2018-07-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21488
  
**[Test build #93736 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93736/testReport)**
 for PR 21488 at commit 
[`aa69915`](https://github.com/apache/spark/commit/aa69915165d9aaca2bcb5d22fb2fc9467bf16826).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21439: [SPARK-24391][SQL] Support arrays of any types by from_j...

2018-07-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21439
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21439: [SPARK-24391][SQL] Support arrays of any types by from_j...

2018-07-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21439
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93734/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21439: [SPARK-24391][SQL] Support arrays of any types by from_j...

2018-07-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21439
  
**[Test build #93734 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93734/testReport)**
 for PR 21439 at commit 
[`021350b`](https://github.com/apache/spark/commit/021350b29c11c19c5a1b8343553c4a531abe39a2).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21772: [SPARK-24809] [SQL] Serializing LongToUnsafeRowMap in ex...

2018-07-28 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/21772
  
LGTM too.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21608: [SPARK-24626] [SQL] Improve location size calculation in...

2018-07-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21608
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21608: [SPARK-24626] [SQL] Improve location size calculation in...

2018-07-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21608
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93731/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21608: [SPARK-24626] [SQL] Improve location size calculation in...

2018-07-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21608
  
**[Test build #93731 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93731/testReport)**
 for PR 21608 at commit 
[`e727b69`](https://github.com/apache/spark/commit/e727b693e2814ef7f8ce9622bb5bff9714cdad31).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21909: [SPARK-24959][SQL] Speed up count() for JSON and CSV

2018-07-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21909
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93732/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21909: [SPARK-24959][SQL] Speed up count() for JSON and CSV

2018-07-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21909
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21909: [SPARK-24959][SQL] Speed up count() for JSON and CSV

2018-07-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21909
  
**[Test build #93732 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93732/testReport)**
 for PR 21909 at commit 
[`359c4fc`](https://github.com/apache/spark/commit/359c4fcbfdb4f4e77faa3977f381dc8e819e46fa).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21908: [MINOR][CORE][TEST] Fix afterEach() in TastSetManagerSui...

2018-07-28 Thread kiszk
Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/21908
  
good catch, LGTM


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21699: [SPARK-24722][SQL] pivot() with Column type argument

2018-07-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21699
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21699: [SPARK-24722][SQL] pivot() with Column type argument

2018-07-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21699
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93733/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21699: [SPARK-24722][SQL] pivot() with Column type argument

2018-07-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21699
  
**[Test build #93733 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93733/testReport)**
 for PR 21699 at commit 
[`e76e7ad`](https://github.com/apache/spark/commit/e76e7adcca6787cb334b19f8db35f3a4ec61bafc).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21895: [SPARK-24948][SHS] Delegate check access permissions to ...

2018-07-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21895
  
**[Test build #93735 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93735/testReport)**
 for PR 21895 at commit 
[`2ad5285`](https://github.com/apache/spark/commit/2ad52858ba9d17e26e6d7ea11658515af7a02a03).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21895: [SPARK-24948][SHS] Delegate check access permissions to ...

2018-07-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21895
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21895: [SPARK-24948][SHS] Delegate check access permissions to ...

2018-07-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21895
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1450/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21899: [SPARK-24912][SQL] Don't obscure source of OOM du...

2018-07-28 Thread MaxGekk
Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/21899#discussion_r205954659
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/BroadcastExchangeExec.scala
 ---
@@ -118,12 +119,19 @@ case class BroadcastExchangeExec(
   // SparkFatalException, which is a subclass of Exception. 
ThreadUtils.awaitResult
   // will catch this exception and re-throw the wrapped fatal 
throwable.
   case oe: OutOfMemoryError =>
-throw new SparkFatalException(
+val sizeMessage = if (dataSize != -1) {
+  s"; Size of table is $dataSize"
--- End diff --

You are taking into account only size of the `relation`.  Probably, result 
of `child.executeCollectIterator()` consumes some memory too, and an user will 
need more memory than `dataSize`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21898: [SPARK-24817][Core] Implement BarrierTaskContext.barrier...

2018-07-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21898
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93730/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21898: [SPARK-24817][Core] Implement BarrierTaskContext.barrier...

2018-07-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21898
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21898: [SPARK-24817][Core] Implement BarrierTaskContext.barrier...

2018-07-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21898
  
**[Test build #93730 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93730/testReport)**
 for PR 21898 at commit 
[`5c5db85`](https://github.com/apache/spark/commit/5c5db85723e10a1c507c83ec2e370996202a77fe).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21908: [MINOR][CORE][TEST] Fix afterEach() in TastSetManagerSui...

2018-07-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21908
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93727/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21908: [MINOR][CORE][TEST] Fix afterEach() in TastSetManagerSui...

2018-07-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21908
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21908: [MINOR][CORE][TEST] Fix afterEach() in TastSetManagerSui...

2018-07-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21908
  
**[Test build #93727 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93727/testReport)**
 for PR 21908 at commit 
[`0bb6a5b`](https://github.com/apache/spark/commit/0bb6a5b29beb987fdaac2d595cfd63fbd752af65).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21439: [SPARK-24391][SQL] Support arrays of any types by from_j...

2018-07-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21439
  
**[Test build #93734 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93734/testReport)**
 for PR 21439 at commit 
[`021350b`](https://github.com/apache/spark/commit/021350b29c11c19c5a1b8343553c4a531abe39a2).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21439: [SPARK-24391][SQL] Support arrays of any types by...

2018-07-28 Thread MaxGekk
Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/21439#discussion_r205953494
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
 ---
@@ -101,6 +102,17 @@ class JacksonParser(
 }
   }
 
+  private def makeArrayRootConverter(at: ArrayType): JsonParser => 
Seq[InternalRow] = {
+val elemConverter = makeConverter(at.elementType)
+(parser: JsonParser) => parseJsonToken[Seq[InternalRow]](parser, at) {
+  case START_ARRAY => Seq(InternalRow(convertArray(parser, 
elemConverter)))
+  case START_OBJECT if at.elementType.isInstanceOf[StructType] =>
--- End diff --

I am adding a comment for this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21895: [SPARK-24948][SHS] Delegate check access permissions to ...

2018-07-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21895
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93726/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21895: [SPARK-24948][SHS] Delegate check access permissions to ...

2018-07-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21895
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21895: [SPARK-24948][SHS] Delegate check access permissions to ...

2018-07-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21895
  
**[Test build #93726 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93726/testReport)**
 for PR 21895 at commit 
[`480e326`](https://github.com/apache/spark/commit/480e3268f51dbd830f5ee5013efd2315f3aff3a9).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21909: [SPARK-24959][SQL] Speed up count() for JSON and CSV

2018-07-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21909
  
**[Test build #93732 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93732/testReport)**
 for PR 21909 at commit 
[`359c4fc`](https://github.com/apache/spark/commit/359c4fcbfdb4f4e77faa3977f381dc8e819e46fa).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21699: [SPARK-24722][SQL] pivot() with Column type argument

2018-07-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21699
  
**[Test build #93733 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93733/testReport)**
 for PR 21699 at commit 
[`e76e7ad`](https://github.com/apache/spark/commit/e76e7adcca6787cb334b19f8db35f3a4ec61bafc).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21884: [SPARK-24960][Kubernetes] explicitly expose ports on dri...

2018-07-28 Thread adelbertc
Github user adelbertc commented on the issue:

https://github.com/apache/spark/pull/21884
  
@mccheah Done https://issues.apache.org/jira/browse/SPARK-24960

Re: Exposing ports, my understanding of Kubernetes' networking model is it 
is possible for a cluster to be setup such that Pod ports are closed, perhaps 
as a security measure. At least in our Kubernetes cluster (tested both on 1.6.x 
and 1.8.x) the SparkPi example failed with a connection error without this 
change, and succeeded after. (I've also added this information to the JIRA)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21909: [SPARK-24959][SQL] Speed up count() for JSON and CSV

2018-07-28 Thread MaxGekk
Github user MaxGekk commented on the issue:

https://github.com/apache/spark/pull/21909
  
jenkins, retest this, please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21608: [SPARK-24626] [SQL] Improve location size calculation in...

2018-07-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21608
  
**[Test build #93731 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93731/testReport)**
 for PR 21608 at commit 
[`e727b69`](https://github.com/apache/spark/commit/e727b693e2814ef7f8ce9622bb5bff9714cdad31).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21909: [SPARK-24959][SQL] Speed up count() for JSON and CSV

2018-07-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21909
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21909: [SPARK-24959][SQL] Speed up count() for JSON and CSV

2018-07-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21909
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93729/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21909: [SPARK-24959][SQL] Speed up count() for JSON and CSV

2018-07-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21909
  
**[Test build #93729 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93729/testReport)**
 for PR 21909 at commit 
[`359c4fc`](https://github.com/apache/spark/commit/359c4fcbfdb4f4e77faa3977f381dc8e819e46fa).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pys...

2018-07-28 Thread mstreuhofer
Github user mstreuhofer commented on a diff in the pull request:

https://github.com/apache/spark/pull/13599#discussion_r205951170
  
--- Diff: 
core/src/main/scala/org/apache/spark/api/python/VirtualEnvFactory.scala ---
@@ -0,0 +1,164 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.api.python
+
+import java.io.File
+import java.util.{Map => JMap}
+import java.util.Arrays
+import java.util.concurrent.atomic.AtomicInteger
+
+import scala.collection.JavaConverters._
+
+import com.google.common.io.Files
+
+import org.apache.spark.SparkConf
+import org.apache.spark.internal.Logging
+
+
+class VirtualEnvFactory(pythonExec: String, conf: SparkConf, isDriver: 
Boolean)
+  extends Logging {
+
+  private val virtualEnvType = conf.get("spark.pyspark.virtualenv.type", 
"native")
+  private val virtualEnvBinPath = 
conf.get("spark.pyspark.virtualenv.bin.path", "")
+  private val initPythonPackages = 
conf.getOption("spark.pyspark.virtualenv.packages")
+  private var virtualEnvName: String = _
+  private var virtualPythonExec: String = _
+  private val VIRTUALENV_ID = new AtomicInteger()
+  private var isLauncher: Boolean = false
+
+  // used by launcher when user want to use virtualenv in pyspark shell. 
Launcher need this class
+  // to create virtualenv for driver.
+  def this(pythonExec: String, properties: JMap[String, String], isDriver: 
java.lang.Boolean) {
+this(pythonExec, new SparkConf().setAll(properties.asScala), isDriver)
+this.isLauncher = true
+  }
+
+  /*
+   * Create virtualenv using native virtualenv or conda
+   *
+   */
+  def setupVirtualEnv(): String = {
+/*
+ *
+ * Native Virtualenv:
+ *   -  Execute command: virtualenv -p  --no-site-packages 

+ *   -  Execute command: python -m pip --cache-dir  install 
-r 
+ *
+ * Conda
+ *   -  Execute command: conda create --prefix  --file 
 -y
+ *
+ */
+logInfo("Start to setup virtualenv...")
+logDebug("user.dir=" + System.getProperty("user.dir"))
+logDebug("user.home=" + System.getProperty("user.home"))
+
+require(virtualEnvType == "native" || virtualEnvType == "conda",
+  s"VirtualEnvType: $virtualEnvType is not supported." )
+require(new File(virtualEnvBinPath).exists(),
+  s"VirtualEnvBinPath: $virtualEnvBinPath is not defined or doesn't 
exist.")
+// Two scenarios of creating virtualenv:
+// 1. created in yarn container. Yarn will clean it up after container 
is exited
+// 2. created outside yarn container. Spark need to create temp 
directory and clean it after app
+//finish.
+//  - driver of PySpark shell
+//  - driver of yarn-client mode
+if (isLauncher ||
+  (isDriver && conf.get("spark.submit.deployMode") == "client")) {
+  val virtualenvBasedir = Files.createTempDir()
+  virtualenvBasedir.deleteOnExit()
+  virtualEnvName = virtualenvBasedir.getAbsolutePath
+} else if (isDriver && conf.get("spark.submit.deployMode") == 
"cluster") {
+  virtualEnvName = "virtualenv_driver"
+} else {
+  // use the working directory of Executor
+  virtualEnvName = "virtualenv_" + conf.getAppId + "_" + 
VIRTUALENV_ID.getAndIncrement()
+}
+
+// Use the absolute path of requirement file in the following cases
+// 1. driver of pyspark shell
+// 2. driver of yarn-client mode
+// otherwise just use filename as it would be downloaded to the 
working directory of Executor
+val pysparkRequirements =
+  if (isLauncher ||
+(isDriver && conf.get("spark.submit.deployMode") == "client")) {
+conf.getOption("spark.pyspark.virtualenv.requirements")
+  } else {
+
conf.getOption("spark.pyspark.virtualenv.requirements").map(_.split("/").last)
+  }
+
+val 

[GitHub] spark pull request #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pys...

2018-07-28 Thread mstreuhofer
Github user mstreuhofer commented on a diff in the pull request:

https://github.com/apache/spark/pull/13599#discussion_r205950127
  
--- Diff: 
core/src/main/scala/org/apache/spark/api/python/VirtualEnvFactory.scala ---
@@ -0,0 +1,164 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.api.python
+
+import java.io.File
+import java.util.{Map => JMap}
+import java.util.Arrays
+import java.util.concurrent.atomic.AtomicInteger
+
+import scala.collection.JavaConverters._
+
+import com.google.common.io.Files
+
+import org.apache.spark.SparkConf
+import org.apache.spark.internal.Logging
+
+
+class VirtualEnvFactory(pythonExec: String, conf: SparkConf, isDriver: 
Boolean)
+  extends Logging {
+
+  private val virtualEnvType = conf.get("spark.pyspark.virtualenv.type", 
"native")
+  private val virtualEnvBinPath = 
conf.get("spark.pyspark.virtualenv.bin.path", "")
+  private val initPythonPackages = 
conf.getOption("spark.pyspark.virtualenv.packages")
+  private var virtualEnvName: String = _
+  private var virtualPythonExec: String = _
+  private val VIRTUALENV_ID = new AtomicInteger()
+  private var isLauncher: Boolean = false
+
+  // used by launcher when user want to use virtualenv in pyspark shell. 
Launcher need this class
+  // to create virtualenv for driver.
+  def this(pythonExec: String, properties: JMap[String, String], isDriver: 
java.lang.Boolean) {
+this(pythonExec, new SparkConf().setAll(properties.asScala), isDriver)
+this.isLauncher = true
+  }
+
+  /*
+   * Create virtualenv using native virtualenv or conda
+   *
+   */
+  def setupVirtualEnv(): String = {
+/*
+ *
+ * Native Virtualenv:
+ *   -  Execute command: virtualenv -p  --no-site-packages 

+ *   -  Execute command: python -m pip --cache-dir  install 
-r 
+ *
+ * Conda
+ *   -  Execute command: conda create --prefix  --file 
 -y
+ *
+ */
+logInfo("Start to setup virtualenv...")
+logDebug("user.dir=" + System.getProperty("user.dir"))
+logDebug("user.home=" + System.getProperty("user.home"))
+
+require(virtualEnvType == "native" || virtualEnvType == "conda",
+  s"VirtualEnvType: $virtualEnvType is not supported." )
+require(new File(virtualEnvBinPath).exists(),
+  s"VirtualEnvBinPath: $virtualEnvBinPath is not defined or doesn't 
exist.")
+// Two scenarios of creating virtualenv:
+// 1. created in yarn container. Yarn will clean it up after container 
is exited
+// 2. created outside yarn container. Spark need to create temp 
directory and clean it after app
+//finish.
+//  - driver of PySpark shell
+//  - driver of yarn-client mode
+if (isLauncher ||
+  (isDriver && conf.get("spark.submit.deployMode") == "client")) {
+  val virtualenvBasedir = Files.createTempDir()
+  virtualenvBasedir.deleteOnExit()
--- End diff --

the temporary directory is not being deleted on exit.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21895: [SPARK-24948][SHS] Delegate check access permissions to ...

2018-07-28 Thread mridulm
Github user mridulm commented on the issue:

https://github.com/apache/spark/pull/21895
  
+CC @jerryshao 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21895: [SPARK-24948][SHS] Delegate check access permissi...

2018-07-28 Thread mridulm
Github user mridulm commented on a diff in the pull request:

https://github.com/apache/spark/pull/21895#discussion_r205948923
  
--- Diff: 
core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala ---
@@ -973,6 +973,38 @@ private[history] object FsHistoryProvider {
   private[history] val CURRENT_LISTING_VERSION = 1L
 }
 
+private[history] trait CachedFileSystemHelper extends Logging {
+  protected def fs: FileSystem
+
+  /**
+   * Cache containing the result for the already checked files.
+   */
+  // Visible for testing.
+  private[history] val cache = new mutable.HashMap[String, Boolean]
--- End diff --

For long running history server in busy clusters (particularly where 
`spark.history.fs.cleaner.maxAge` is configured to be low), this Map will cause 
OOM.
Either an LRU cache or a disk backed map with periodic cleanup (based on 
maxAge) might be better ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21898: [SPARK-24817][Core] Implement BarrierTaskContext.barrier...

2018-07-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21898
  
**[Test build #93730 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93730/testReport)**
 for PR 21898 at commit 
[`5c5db85`](https://github.com/apache/spark/commit/5c5db85723e10a1c507c83ec2e370996202a77fe).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21898: [SPARK-24817][Core] Implement BarrierTaskContext.barrier...

2018-07-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21898
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1449/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21898: [SPARK-24817][Core] Implement BarrierTaskContext.barrier...

2018-07-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21898
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21898: [SPARK-24817][Core] Implement BarrierTaskContext.barrier...

2018-07-28 Thread kiszk
Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/21898
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19659: [SPARK-19668][ML] Multiple NGram sizes

2018-07-28 Thread mgaido91
Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/19659
  
@holdenk I can take this up if this is needed. Let me know. Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21847: [SPARK-24855][SQL][EXTERNAL]: Built-in AVRO support shou...

2018-07-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21847
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93728/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21847: [SPARK-24855][SQL][EXTERNAL]: Built-in AVRO support shou...

2018-07-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21847
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21847: [SPARK-24855][SQL][EXTERNAL]: Built-in AVRO support shou...

2018-07-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21847
  
**[Test build #93728 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93728/testReport)**
 for PR 21847 at commit 
[`183b684`](https://github.com/apache/spark/commit/183b6841a8a121bb07d9fe553ca213eda78eb35a).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21904: [SPARK-24953] [SQL] Prune a branch in `CaseWhen` ...

2018-07-28 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21904#discussion_r205947535
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala
 ---
@@ -416,6 +416,29 @@ object SimplifyConditionals extends Rule[LogicalPlan] 
with PredicateHelper {
 // these branches can be pruned away
 val (h, t) = branches.span(_._1 != TrueLiteral)
 CaseWhen( h :+ t.head, None)
+
+  case e @ CaseWhen(branches, _) =>
+val newBranches = branches.foldLeft(List[(Expression, 
Expression)]()) {
+  case (newBranches, branch) =>
+if (newBranches.exists(_._1.semanticEquals(branch._1))) {
+  // If a condition in a branch is previously seen, this 
branch can be pruned.
+  // TODO: In fact, if a condition is a sub-condition of the 
previous one,
+  // TODO: it can be pruned. This is less strict and can be 
implemented
+  // TODO: by decomposing seen conditions.
+  newBranches
+} else if (newBranches.nonEmpty && 
newBranches.last._2.semanticEquals(branch._2)) {
+  // If the outputs of two adjacent branches are the same, two 
branches can be combined.
+  newBranches.take(newBranches.length - 1)
--- End diff --

nit: `newBranches.init`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21904: [SPARK-24953] [SQL] Prune a branch in `CaseWhen` ...

2018-07-28 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21904#discussion_r205947481
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala
 ---
@@ -416,6 +416,29 @@ object SimplifyConditionals extends Rule[LogicalPlan] 
with PredicateHelper {
 // these branches can be pruned away
 val (h, t) = branches.span(_._1 != TrueLiteral)
 CaseWhen( h :+ t.head, None)
+
+  case e @ CaseWhen(branches, _) =>
+val newBranches = branches.foldLeft(List[(Expression, 
Expression)]()) {
--- End diff --

what about using `ArrayBuffer`? So we don't have to recreate the List at 
every iteration...


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21909: [SPARK-24959][SQL] Speed up count() for JSON and CSV

2018-07-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21909
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21909: [SPARK-24959][SQL] Speed up count() for JSON and CSV

2018-07-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21909
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21909: [SPARK-24959][SQL] Speed up count() for JSON and CSV

2018-07-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21909
  
**[Test build #93729 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93729/testReport)**
 for PR 21909 at commit 
[`359c4fc`](https://github.com/apache/spark/commit/359c4fcbfdb4f4e77faa3977f381dc8e819e46fa).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21909: [SPARK-24959][SQL] Speed up count() for JSON and ...

2018-07-28 Thread MaxGekk
GitHub user MaxGekk opened a pull request:

https://github.com/apache/spark/pull/21909

[SPARK-24959][SQL] Speed up count() for JSON and CSV

## What changes were proposed in this pull request?

In the PR, I propose to skip invoking of the CSV/JSON parser per each line 
in the case if the required schema is empty. Added benchmarks for `count()` 
shows performance improvement up to **3.5 times**.

Before:

```
Count a dataset with 10 columns:  Best/Avg Time(ms)Rate(M/s)   Per 
Row(ns)

--
JSON count()   7676 / 7715  1.3 
767.6
CSV count()3309 / 3363  3.0 
330.9
``` 

After:

```
Count a dataset with 10 columns:  Best/Avg Time(ms)Rate(M/s)   Per 
Row(ns)

--
JSON count()   2104 / 2156  4.8 
210.4
CSV count()2332 / 2386  4.3 
233.2
```

## How was this patch tested?

It was tested by `CSVSuite` and `JSONSuite` as well as on added benchmarks.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/MaxGekk/spark-1 empty-schema-optimization

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21909.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21909


commit bc4ce261a2d13be0a31b18f006da79b55880d409
Author: Maxim Gekk 
Date:   2018-07-28T15:31:20Z

Added a benchmark for count()

commit 91250d21d4bb451062873c59df6fe3b4669bc5ff
Author: Maxim Gekk 
Date:   2018-07-28T15:50:15Z

Added a CSV benchmark for count()

commit bdc5ea540b9eb62bb28606bdeb311ce5662e4bf7
Author: Maxim Gekk 
Date:   2018-07-28T15:59:44Z

Speed up count()

commit d40f9bb229ab8ea9e2d95499ae203f7c41098bcd
Author: Maxim Gekk 
Date:   2018-07-28T16:00:17Z

Updating CSV and JSON benchmarks for count()

commit abd8572497ff742ef6ea942864195be75a40ca71
Author: Maxim Gekk 
Date:   2018-07-28T16:23:03Z

Fix benchmark's output

commit 359c4fcbfdb4f4e77faa3977f381dc8e819e46fa
Author: Maxim Gekk 
Date:   2018-07-28T16:23:44Z

Uncomment other benchmarks




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21909: [SPARK-24959][SQL] Speed up count() for JSON and CSV

2018-07-28 Thread holdensmagicalunicorn
Github user holdensmagicalunicorn commented on the issue:

https://github.com/apache/spark/pull/21909
  
@MaxGekk, thanks! I am a bot who has found some folks who might be able to 
help with the review:@HyukjinKwon, @gatorsmile and @cloud-fan


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21847: [SPARK-24855][SQL][EXTERNAL]: Built-in AVRO support shou...

2018-07-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21847
  
**[Test build #93728 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93728/testReport)**
 for PR 21847 at commit 
[`183b684`](https://github.com/apache/spark/commit/183b6841a8a121bb07d9fe553ca213eda78eb35a).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21908: [MINOR][CORE][TEST] Fix afterEach() in TastSetManagerSui...

2018-07-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21908
  
**[Test build #93727 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93727/testReport)**
 for PR 21908 at commit 
[`0bb6a5b`](https://github.com/apache/spark/commit/0bb6a5b29beb987fdaac2d595cfd63fbd752af65).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21908: [MINOR][CORE][TEST] Fix afterEach() in TastSetManagerSui...

2018-07-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21908
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1448/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21908: [MINOR][CORE][TEST] Fix afterEach() in TastSetManagerSui...

2018-07-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21908
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21908: [MINOR][CORE][TEST] Fix afterEach() in TastSetManagerSui...

2018-07-28 Thread holdensmagicalunicorn
Github user holdensmagicalunicorn commented on the issue:

https://github.com/apache/spark/pull/21908
  
@jiangxb1987, thanks! I am a bot who has found some folks who might be able 
to help with the review:@squito, @kayousterhout and @mateiz


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21908: [MINOR][CORE][TEST] Fix afterEach() in TastSetMan...

2018-07-28 Thread jiangxb1987
GitHub user jiangxb1987 opened a pull request:

https://github.com/apache/spark/pull/21908

[MINOR][CORE][TEST] Fix afterEach() in TastSetManagerSuite and 
TaskSchedulerImplSuite

## What changes were proposed in this pull request?

In the `afterEach()` method of both `TastSetManagerSuite` and 
`TaskSchedulerImplSuite`, `super.afterEach()` shall be called at the end, 
because it shall stop the SparkContext.


https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93706/testReport/org.apache.spark.scheduler/TaskSchedulerImplSuite/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/
The test failure is caused by the above reason, the newly added 
`barrierCoordinator` required `rpcEnv` which has been stopped before 
`TaskSchedulerImpl` doing cleanup.

## How was this patch tested?
Existing tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jiangxb1987/spark afterEach

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21908.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21908


commit 0bb6a5b29beb987fdaac2d595cfd63fbd752af65
Author: Xingbo Jiang 
Date:   2018-07-28T16:09:07Z

fix afterEach




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21852: [SPARK-24893] [SQL] Remove the entire CaseWhen if...

2018-07-28 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/21852#discussion_r205946975
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala
 ---
@@ -416,6 +416,23 @@ object SimplifyConditionals extends Rule[LogicalPlan] 
with PredicateHelper {
 // these branches can be pruned away
 val (h, t) = branches.span(_._1 != TrueLiteral)
 CaseWhen( h :+ t.head, None)
+
+  case e @ CaseWhen(branches, Some(elseValue)) if {
--- End diff --

We can not. When no `elseValue`, all the conditions are required to 
evaluated before hitting the default `elseValue` which is `null`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21901: [SPARK-24950][SQL] DateTimeUtilsSuite daysToMilli...

2018-07-28 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/21901


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21895: [SPARK-24948][SHS] Delegate check access permissions to ...

2018-07-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21895
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21895: [SPARK-24948][SHS] Delegate check access permissions to ...

2018-07-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21895
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1447/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21901: [SPARK-24950][SQL] DateTimeUtilsSuite daysToMillis and m...

2018-07-28 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/21901
  
Merged to master


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21895: [SPARK-24948][SHS] Delegate check access permissions to ...

2018-07-28 Thread mgaido91
Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/21895
  
cc @mridulm too


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21895: [SPARK-24948][SHS] Delegate check access permissions to ...

2018-07-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21895
  
**[Test build #93726 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93726/testReport)**
 for PR 21895 at commit 
[`480e326`](https://github.com/apache/spark/commit/480e3268f51dbd830f5ee5013efd2315f3aff3a9).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21886: [SPARK-21274][SQL] Implement INTERSECT ALL clause

2018-07-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21886
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93725/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   >