[GitHub] spark issue #21583: [SPARK-23984][K8S][Test] Added Integration Tests for PyS...

2018-06-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21583
  
Kubernetes integration test starting
URL: 
https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-make-spark-distribution-unified/385/



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21508: [SPARK-24488] [SQL] Fix issue when generator is aliased ...

2018-06-21 Thread bkrieger
Github user bkrieger commented on the issue:

https://github.com/apache/spark/pull/21508
  
@gatorsmile @hvanhovell can you take another look at this?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21583: [SPARK-23984][K8S][Test] Added Integration Tests for PyS...

2018-06-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21583
  
**[Test build #92182 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92182/testReport)**
 for PR 21583 at commit 
[`2707dee`](https://github.com/apache/spark/commit/2707dee967fd6c4cebbe96cc7ae40feb5bfced24).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21482: [SPARK-24393][SQL] SQL builtin: isinf

2018-06-21 Thread henryr
Github user henryr commented on the issue:

https://github.com/apache/spark/pull/21482
  
Any further comments here?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21583: [SPARK-23984][K8S][Test] Added Integration Tests for PyS...

2018-06-21 Thread ifilonenko
Github user ifilonenko commented on the issue:

https://github.com/apache/spark/pull/21583
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21606: [SPARK-24552][core][SQL] Use task ID instead of a...

2018-06-21 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21606#discussion_r197263827
  
--- Diff: 
core/src/main/scala/org/apache/spark/internal/io/SparkHadoopWriter.scala ---
@@ -104,12 +104,12 @@ object SparkHadoopWriter extends Logging {
   jobTrackerId: String,
   commitJobId: Int,
   sparkPartitionId: Int,
-  sparkAttemptNumber: Int,
+  sparkTaskId: Long,
   committer: FileCommitProtocol,
   iterator: Iterator[(K, V)]): TaskCommitMessage = {
 // Set up a task.
 val taskContext = config.createTaskAttemptContext(
-  jobTrackerId, commitJobId, sparkPartitionId, sparkAttemptNumber)
+  jobTrackerId, commitJobId, sparkPartitionId, sparkTaskId.toInt)
--- End diff --

is it safe?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21606: [SPARK-24552][core][SQL] Use task ID instead of a...

2018-06-21 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21606#discussion_r197265481
  
--- Diff: 
core/src/main/scala/org/apache/spark/internal/io/SparkHadoopWriter.scala ---
@@ -104,12 +104,12 @@ object SparkHadoopWriter extends Logging {
   jobTrackerId: String,
   commitJobId: Int,
   sparkPartitionId: Int,
-  sparkAttemptNumber: Int,
+  sparkTaskId: Long,
   committer: FileCommitProtocol,
   iterator: Iterator[(K, V)]): TaskCommitMessage = {
 // Set up a task.
 val taskContext = config.createTaskAttemptContext(
-  jobTrackerId, commitJobId, sparkPartitionId, sparkAttemptNumber)
+  jobTrackerId, commitJobId, sparkPartitionId, sparkTaskId.toInt)
--- End diff --

task id is unique across the entire Spark application, which means we may 
have very large task id in a long-running micro-batch streaming application.

If we do need an int here, I'd suggest we combine `stageAttemptNumber` and 
`taskAttemptNumber` into a int, which is much less risky.(Spark won't have a 
lot of stage/task attempts)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21606: [SPARK-24552][core][SQL] Use task ID instead of attempt ...

2018-06-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21606
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21606: [SPARK-24552][core][SQL] Use task ID instead of attempt ...

2018-06-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21606
  
**[Test build #92181 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92181/testReport)**
 for PR 21606 at commit 
[`7233a5f`](https://github.com/apache/spark/commit/7233a5fd7b154e2a1400c5fac11d0356a22f5f98).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21606: [SPARK-24552][core][SQL] Use task ID instead of attempt ...

2018-06-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21606
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/384/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

2018-06-21 Thread vanzin
Github user vanzin commented on the issue:

https://github.com/apache/spark/pull/21558
  
See link above for the updated PR.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21606: [SPARK-24552][core][SQL] Use task ID instead of attempt ...

2018-06-21 Thread vanzin
Github user vanzin commented on the issue:

https://github.com/apache/spark/pull/21606
  
Credit here should go to @rdblue when merging.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21606: [SPARK-24552][core][SQL] Use task ID instead of a...

2018-06-21 Thread vanzin
GitHub user vanzin opened a pull request:

https://github.com/apache/spark/pull/21606

[SPARK-24552][core][SQL] Use task ID instead of attempt number for writes.

This passes the unique task attempt id instead of attempt number to v2 data 
sources because attempt number is reused when stages are retried. When attempt 
numbers are reused, sources that track data by partition id and attempt number 
may incorrectly clean up data because the same attempt number can be both 
committed and aborted.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/vanzin/spark SPARK-24552.2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21606.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21606


commit 6c60d1462c34f01610ada50c989832775b6fd117
Author: Ryan Blue 
Date:   2018-06-13T19:50:00Z

SPARK-24552: Use task ID instead of attempt number for v2 writes.

commit 2e6552460eed3013e649b06b16a1d14b1e542e2d
Author: Marcelo Vanzin 
Date:   2018-06-21T17:21:00Z

Rename attemptId -> taskId for clarity.

commit 3561723341c3062ba7d8682ea272c549b4bdc245
Author: Marcelo Vanzin 
Date:   2018-06-21T17:28:12Z

Use task ID instead of attempt for the Hadoop API too.

commit d5a079d439740f3067722d4e8c9e8e94f292017c
Author: Marcelo Vanzin 
Date:   2018-06-21T18:37:54Z

Merge branch 'master' into SPARK-24552.2

commit fdcd39c852e9a2d70da95c37da04190910e7b2f0
Author: Marcelo Vanzin 
Date:   2018-06-21T18:51:48Z

Log message update.

commit 7233a5fd7b154e2a1400c5fac11d0356a22f5f98
Author: Marcelo Vanzin 
Date:   2018-06-21T18:57:02Z

Javadoc updates.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21282: [SPARK-23934][SQL] Adding map_from_entries function

2018-06-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21282
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21282: [SPARK-23934][SQL] Adding map_from_entries function

2018-06-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21282
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92175/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21282: [SPARK-23934][SQL] Adding map_from_entries function

2018-06-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21282
  
**[Test build #92175 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92175/testReport)**
 for PR 21282 at commit 
[`4eaedc5`](https://github.com/apache/spark/commit/4eaedc50f92a3dd1ee2100fbbd5ac951344ece75).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21542: [SPARK-24529][Build][test-maven] Add spotbugs into maven...

2018-06-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21542
  
**[Test build #92180 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92180/testReport)**
 for PR 21542 at commit 
[`86ef42c`](https://github.com/apache/spark/commit/86ef42ceaa3bdc623fa0b01f3ea076cf0f63902a).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21542: [SPARK-24529][Build][test-maven] Add spotbugs into maven...

2018-06-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21542
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/383/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21542: [SPARK-24529][Build][test-maven] Add spotbugs into maven...

2018-06-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21542
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21605: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

2018-06-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21605
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21605: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

2018-06-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21605
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92176/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21605: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

2018-06-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21605
  
**[Test build #92176 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92176/testReport)**
 for PR 21605 at commit 
[`fb8cd54`](https://github.com/apache/spark/commit/fb8cd5466c39b69822db145d17a2e40255e74b03).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21594: [SPARK-24596][SQL] Non-cascading Cache Invalidati...

2018-06-21 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21594#discussion_r197247925
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala ---
@@ -105,24 +105,58 @@ class CacheManager extends Logging {
   }
 
   /**
-   * Un-cache all the cache entries that refer to the given plan.
+   * Un-cache the given plan or all the cache entries that refer to the 
given plan.
+   * @param query The [[Dataset]] to be un-cached.
+   * @param cascade   If true, un-cache all the cache entries that refer 
to the given
+   *  [[Dataset]]; otherwise un-cache the given 
[[Dataset]] only.
+   * @param blocking  Whether to block until all blocks are deleted.
*/
-  def uncacheQuery(query: Dataset[_], blocking: Boolean = true): Unit = 
writeLock {
-uncacheQuery(query.sparkSession, query.logicalPlan, blocking)
+  def uncacheQuery(query: Dataset[_],
+cascade: Boolean, blocking: Boolean = true): Unit = writeLock {
+uncacheQuery(query.sparkSession, query.logicalPlan, cascade, blocking)
   }
 
   /**
-   * Un-cache all the cache entries that refer to the given plan.
+   * Un-cache the given plan or all the cache entries that refer to the 
given plan.
+   * @param spark The Spark session.
+   * @param plan  The plan to be un-cached.
+   * @param cascade   If true, un-cache all the cache entries that refer 
to the given
+   *  plan; otherwise un-cache the given plan only.
+   * @param blocking  Whether to block until all blocks are deleted.
*/
-  def uncacheQuery(spark: SparkSession, plan: LogicalPlan, blocking: 
Boolean): Unit = writeLock {
+  def uncacheQuery(spark: SparkSession, plan: LogicalPlan,
+cascade: Boolean, blocking: Boolean): Unit = writeLock {
+val condition: LogicalPlan => Boolean =
--- End diff --

`condition` -> `shouldUnCache`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21594: [SPARK-24596][SQL] Non-cascading Cache Invalidati...

2018-06-21 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21594#discussion_r197247763
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala ---
@@ -105,24 +105,58 @@ class CacheManager extends Logging {
   }
 
   /**
-   * Un-cache all the cache entries that refer to the given plan.
+   * Un-cache the given plan or all the cache entries that refer to the 
given plan.
+   * @param query The [[Dataset]] to be un-cached.
+   * @param cascade   If true, un-cache all the cache entries that refer 
to the given
+   *  [[Dataset]]; otherwise un-cache the given 
[[Dataset]] only.
+   * @param blocking  Whether to block until all blocks are deleted.
*/
-  def uncacheQuery(query: Dataset[_], blocking: Boolean = true): Unit = 
writeLock {
-uncacheQuery(query.sparkSession, query.logicalPlan, blocking)
+  def uncacheQuery(query: Dataset[_],
+cascade: Boolean, blocking: Boolean = true): Unit = writeLock {
+uncacheQuery(query.sparkSession, query.logicalPlan, cascade, blocking)
   }
 
   /**
-   * Un-cache all the cache entries that refer to the given plan.
+   * Un-cache the given plan or all the cache entries that refer to the 
given plan.
+   * @param spark The Spark session.
+   * @param plan  The plan to be un-cached.
+   * @param cascade   If true, un-cache all the cache entries that refer 
to the given
+   *  plan; otherwise un-cache the given plan only.
+   * @param blocking  Whether to block until all blocks are deleted.
*/
-  def uncacheQuery(spark: SparkSession, plan: LogicalPlan, blocking: 
Boolean): Unit = writeLock {
+  def uncacheQuery(spark: SparkSession, plan: LogicalPlan,
+cascade: Boolean, blocking: Boolean): Unit = writeLock {
--- End diff --

ditto


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21594: [SPARK-24596][SQL] Non-cascading Cache Invalidati...

2018-06-21 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21594#discussion_r197247440
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -2971,7 +2971,7 @@ class Dataset[T] private[sql](
* @since 1.6.0
*/
   def unpersist(blocking: Boolean): this.type = {
-sparkSession.sharedState.cacheManager.uncacheQuery(this, blocking)
+sparkSession.sharedState.cacheManager.uncacheQuery(this, false, 
blocking)
--- End diff --

nit: it's clearer to write `cascade =false`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21594: [SPARK-24596][SQL] Non-cascading Cache Invalidati...

2018-06-21 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21594#discussion_r197247632
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala ---
@@ -105,24 +105,58 @@ class CacheManager extends Logging {
   }
 
   /**
-   * Un-cache all the cache entries that refer to the given plan.
+   * Un-cache the given plan or all the cache entries that refer to the 
given plan.
+   * @param query The [[Dataset]] to be un-cached.
+   * @param cascade   If true, un-cache all the cache entries that refer 
to the given
+   *  [[Dataset]]; otherwise un-cache the given 
[[Dataset]] only.
+   * @param blocking  Whether to block until all blocks are deleted.
*/
-  def uncacheQuery(query: Dataset[_], blocking: Boolean = true): Unit = 
writeLock {
-uncacheQuery(query.sparkSession, query.logicalPlan, blocking)
+  def uncacheQuery(query: Dataset[_],
+cascade: Boolean, blocking: Boolean = true): Unit = writeLock {
--- End diff --

nit
```
def f(
param1: X,
param2: Y)
```
4 space indentation.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21599: [SPARK-24598][SQL] Overflow on arithmetic operati...

2018-06-21 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21599#discussion_r197246916
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala
 ---
@@ -128,17 +128,31 @@ abstract class BinaryArithmetic extends 
BinaryOperator with NullIntolerant {
   def calendarIntervalMethod: String =
 sys.error("BinaryArithmetics must override either 
calendarIntervalMethod or genCode")
 
+  def checkOverflowCode(result: String, op1: String, op2: String): String =
+sys.error("BinaryArithmetics must override either checkOverflowCode or 
genCode")
+
   override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = 
dataType match {
 case _: DecimalType =>
   defineCodeGen(ctx, ev, (eval1, eval2) => 
s"$eval1.$decimalMethod($eval2)")
 case CalendarIntervalType =>
   defineCodeGen(ctx, ev, (eval1, eval2) => 
s"$eval1.$calendarIntervalMethod($eval2)")
+// In the following cases, overflow can happen, so we need to check 
the result is valid.
+// Otherwise we throw an ArithmeticException
--- End diff --

Personally, I am quite against returning null. It is not something a user 
expects, so he/she is likely not to check for it (when I see a NULL myself, I 
think that one of the 2 operands was NULL, not that an overflow occurred), so 
he/she won't realize the issue and would find corrupted data. Moreover, this is 
not how RDBMS behaves and it is against SQL standard. So I think that the 
behavior which was chosen for DECIMAL was wrong and I'd prefer not to introduce 
the same behavior also in other places.

Anyway I see your point about consistency over the codebase and it makes 
sense.

I'd love to know @gatorsmile and @hvanhovell's opinions too.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21537: [SPARK-24505][SQL] Convert strings in codegen to ...

2018-06-21 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21537#discussion_r197245629
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/javaCode.scala
 ---
@@ -256,6 +283,22 @@ object EmptyBlock extends Block with Serializable {
   override def + (other: Block): Block = other
 }
 
+/**
+ * A block inlines all types of input arguments into a string without
+ * tracking any reference of `JavaCode` instances.
+ */
+case class InlineBlock(block: String) extends Block {
+  override val code: String = block
+  override val exprValues: Set[ExprValue] = Set.empty
+
+  override def + (other: Block): Block = other match {
+case c: CodeBlock => Blocks(Seq(this, c))
+case i: InlineBlock => InlineBlock(block + i.block)
+case b: Blocks => Blocks(Seq(this) ++ b.blocks)
--- End diff --

shall we do that PR first? I feel it's easier to review this PR after we 
clean up the `Block` framework.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21537: [SPARK-24505][SQL] Convert strings in codegen to ...

2018-06-21 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21537#discussion_r197245172
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala
 ---
@@ -1004,26 +1012,29 @@ case class Cast(child: Expression, dataType: 
DataType, timeZoneId: Option[String
   private[this] def castToIntervalCode(from: DataType): CastFunction = 
from match {
 case StringType =>
   (c, evPrim, evNull) =>
-s"""$evPrim = CalendarInterval.fromString($c.toString());
+code"""$evPrim = CalendarInterval.fromString($c.toString());
if(${evPrim} == null) {
  ${evNull} = true;
}
  """.stripMargin
 
   }
 
-  private[this] def decimalToTimestampCode(d: String): String =
-s"($d.toBigDecimal().bigDecimal().multiply(new 
java.math.BigDecimal(100L))).longValue()"
-  private[this] def longToTimeStampCode(l: String): String = s"$l * 
100L"
-  private[this] def timestampToIntegerCode(ts: String): String =
-s"java.lang.Math.floor((double) $ts / 100L)"
-  private[this] def timestampToDoubleCode(ts: String): String = s"$ts / 
100.0"
+  private[this] def decimalToTimestampCode(d: ExprValue): Block = {
+val block = code"new java.math.BigDecimal(100L)"
+code"($d.toBigDecimal().bigDecimal().multiply($block)).longValue()"
+  }
+  private[this] def longToTimeStampCode(l: ExprValue): Block = code"$l * 
100L"
+  private[this] def timestampToIntegerCode(ts: ExprValue): Block =
+code"java.lang.Math.floor((double) $ts / 100L)"
+  private[this] def timestampToDoubleCode(ts: ExprValue): Block =
+code"$ts / 100.0"
 
   private[this] def castToBooleanCode(from: DataType): CastFunction = from 
match {
 case StringType =>
-  val stringUtils = StringUtils.getClass.getName.stripSuffix("$")
+  val stringUtils = 
inline"${StringUtils.getClass.getName.stripSuffix("$")}"
--- End diff --

what's the difference between inline and code?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21602: [SPARK-24613][SQL] Cache with UDF could not be ma...

2018-06-21 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/21602


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21602: [SPARK-24613][SQL] Cache with UDF could not be matched w...

2018-06-21 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/21602
  
Thanks! Merged to master/2.3


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21602: [SPARK-24613][SQL] Cache with UDF could not be matched w...

2018-06-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21602
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92174/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21602: [SPARK-24613][SQL] Cache with UDF could not be matched w...

2018-06-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21602
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21565: [SPARK-24558][Core]wrong Idle Timeout value is used in c...

2018-06-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21565
  
**[Test build #92179 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92179/testReport)**
 for PR 21565 at commit 
[`a5708cc`](https://github.com/apache/spark/commit/a5708cc59786edeae792e926821d8c26b01ee39d).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21602: [SPARK-24613][SQL] Cache with UDF could not be matched w...

2018-06-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21602
  
**[Test build #92174 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92174/testReport)**
 for PR 21602 at commit 
[`377f213`](https://github.com/apache/spark/commit/377f2134e6b4990b4c1d080e5fd5119fe808e057).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21565: [SPARK-24558][Core]wrong Idle Timeout value is used in c...

2018-06-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21565
  
**[Test build #92178 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92178/testReport)**
 for PR 21565 at commit 
[`a5708cc`](https://github.com/apache/spark/commit/a5708cc59786edeae792e926821d8c26b01ee39d).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21565: [SPARK-24558][Core]wrong Idle Timeout value is used in c...

2018-06-21 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/21565
  
ok to test


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21565: [SPARK-24558][Core]wrong Idle Timeout value is us...

2018-06-21 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21565#discussion_r197237620
  
--- Diff: 
core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala ---
@@ -488,9 +488,16 @@ private[spark] class ExecutorAllocationManager(
 newExecutorTotal = numExistingExecutors
 if (testing || executorsRemoved.nonEmpty) {
   executorsRemoved.foreach { removedExecutorId =>
+// If it is cachedBlcok timeout is configured using
+// spark.dynamicAllocation.cachedExecutorIdleTimeout
--- End diff --

can you refine the wording?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21599: [SPARK-24598][SQL] Overflow on arithmetic operati...

2018-06-21 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21599#discussion_r197235683
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala
 ---
@@ -128,17 +128,31 @@ abstract class BinaryArithmetic extends 
BinaryOperator with NullIntolerant {
   def calendarIntervalMethod: String =
 sys.error("BinaryArithmetics must override either 
calendarIntervalMethod or genCode")
 
+  def checkOverflowCode(result: String, op1: String, op2: String): String =
+sys.error("BinaryArithmetics must override either checkOverflowCode or 
genCode")
+
   override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = 
dataType match {
 case _: DecimalType =>
   defineCodeGen(ctx, ev, (eval1, eval2) => 
s"$eval1.$decimalMethod($eval2)")
 case CalendarIntervalType =>
   defineCodeGen(ctx, ev, (eval1, eval2) => 
s"$eval1.$calendarIntervalMethod($eval2)")
+// In the following cases, overflow can happen, so we need to check 
the result is valid.
+// Otherwise we throw an ArithmeticException
--- End diff --

In current Spark we are very conservative about runtime error, as it may 
break the data pipeline middle away, and returning null is a commonly used 
strategy. Shall we follow it here? We can throw exception when we have a strict 
mode.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21061: [SPARK-23914][SQL] Add array_union function

2018-06-21 Thread kiszk
Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/21061#discussion_r197235630
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
 ---
@@ -2355,3 +2355,347 @@ case class ArrayRemove(left: Expression, right: 
Expression)
 
   override def prettyName: String = "array_remove"
 }
+
+object ArraySetLike {
+  def useGenericArrayData(elementSize: Int, length: Int): Boolean = {
+// Use the same calculation in UnsafeArrayData.fromPrimitiveArray()
+val headerInBytes = 
UnsafeArrayData.calculateHeaderPortionInBytes(length)
+val valueRegionInBytes = elementSize.toLong * length
+val totalSizeInLongs = (headerInBytes + valueRegionInBytes + 7) / 8
+totalSizeInLongs > Integer.MAX_VALUE / 8
+  }
+
+  def throwUnionLengthOverflowException(length: Int): Unit = {
+throw new RuntimeException(s"Unsuccessful try to union arrays with 
$length " +
+  s"elements due to exceeding the array size limit " +
+  s"${ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH}.")
+  }
+}
+
+
+abstract class ArraySetLike extends BinaryArrayExpressionWithImplicitCast {
+  override def dataType: DataType = left.dataType
+
+  override def checkInputDataTypes(): TypeCheckResult = {
+val typeCheckResult = super.checkInputDataTypes()
+if (typeCheckResult.isSuccess) {
+  
TypeUtils.checkForOrderingExpr(dataType.asInstanceOf[ArrayType].elementType,
+s"function $prettyName")
+} else {
+  typeCheckResult
+}
+  }
+
+  @transient protected lazy val ordering: Ordering[Any] =
+TypeUtils.getInterpretedOrdering(elementType)
+
+  @transient protected lazy val elementTypeSupportEquals = elementType 
match {
+case BinaryType => false
+case _: AtomicType => true
+case _ => false
+  }
+}
+
+/**
+ * Returns an array of the elements in the union of x and y, without 
duplicates
+ */
+@ExpressionDescription(
+  usage = """
+_FUNC_(array1, array2) - Returns an array of the elements in the union 
of array1 and array2,
+  without duplicates.
+  """,
+  examples = """
+Examples:
+  > SELECT _FUNC_(array(1, 2, 3), array(1, 3, 5));
+   array(1, 2, 3, 5)
+  """,
+  since = "2.4.0")
+case class ArrayUnion(left: Expression, right: Expression) extends 
ArraySetLike {
+  var hsInt: OpenHashSet[Int] = _
+  var hsLong: OpenHashSet[Long] = _
+
+  def assignInt(array: ArrayData, idx: Int, resultArray: ArrayData, pos: 
Int): Boolean = {
+val elem = array.getInt(idx)
+if (!hsInt.contains(elem)) {
+  resultArray.setInt(pos, elem)
+  hsInt.add(elem)
+  true
+} else {
+  false
+}
+  }
+
+  def assignLong(array: ArrayData, idx: Int, resultArray: ArrayData, pos: 
Int): Boolean = {
+val elem = array.getLong(idx)
+if (!hsLong.contains(elem)) {
+  resultArray.setLong(pos, elem)
+  hsLong.add(elem)
+  true
+} else {
+  false
+}
+  }
+
+  def evalPrimitiveType(
+  array1: ArrayData,
+  array2: ArrayData,
+  size: Int,
+  resultArray: ArrayData,
+  isLongType: Boolean): ArrayData = {
+// store elements into resultArray
+var foundNullElement = false
+var pos = 0
+Seq(array1, array2).foreach(array => {
+  var i = 0
+  while (i < array.numElements()) {
+if (array.isNullAt(i)) {
+  if (!foundNullElement) {
+resultArray.setNullAt(pos)
+pos += 1
+foundNullElement = true
+  }
+} else {
+  val assigned = if (!isLongType) {
+assignInt(array, i, resultArray, pos)
+  } else {
+assignLong(array, i, resultArray, pos)
+  }
+  if (assigned) {
+pos += 1
+  }
+}
+i += 1
+  }
+})
+resultArray
+  }
+
+  override def nullSafeEval(input1: Any, input2: Any): Any = {
+val array1 = input1.asInstanceOf[ArrayData]
+val array2 = input2.asInstanceOf[ArrayData]
+
+if (elementTypeSupportEquals) {
+  elementType match {
+case IntegerType =>
+  // avoid boxing of primitive int array elements
+  // calculate result array size
+  val hsSize = new OpenHashSet[Int]
+  Seq(array1, array2).foreach(array => {
+var i = 0
+while (i < array.numElements()) {
+  if (hsSize.size > ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH) 
{
+ArraySetLi

[GitHub] spark pull request #21577: [SPARK-24589][core] Correctly identify tasks in o...

2018-06-21 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/21577


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21577: [SPARK-24589][core] Correctly identify tasks in output c...

2018-06-21 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/21577
  
> I pushed the change for that in: vanzin/spark@e6a862e

I like it, it's simpler to use task id to replace stage attempt id and task 
attempt id. For safety we should do it in master only after this PR is merged.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21482: [SPARK-24393][SQL] SQL builtin: isinf

2018-06-21 Thread henryr
Github user henryr commented on a diff in the pull request:

https://github.com/apache/spark/pull/21482#discussion_r197234189
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -468,6 +468,18 @@ def input_file_name():
 return Column(sc._jvm.functions.input_file_name())
 
 
+@since(2.4)
+def isinf(col):
--- End diff --

@HyukjinKwon could you clarify, please?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21598: [SPARK-24605][SQL] size(null) returns null instead of -1

2018-06-21 Thread markhamstra
Github user markhamstra commented on the issue:

https://github.com/apache/spark/pull/21598
  
> so we can't just change the default value in a feature release

Agreed. Once a particular interface and behavior is in our released public 
API, then we effectively have a contract not to change that behavior. If we are 
going to provide another behavior before making a new major-number release 
(e.g. spark-3.0.0), then we have to provide a user configuration option to 
select that new behavior, and the default behavior if a user doesn't change 
configuration must be the same as before the optional new behavior.

If there is a clear, severe bug (such as data loss or corruption), only 
then we can consider changing the public API before making a new major-number 
release -- but even then we are likely to either go immediately to a new 
major-number or to at least preserve the old, buggy behavior with a 
configuration option. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21587: [SPARK-24588][SS] streaming join should require HashClus...

2018-06-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21587
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/382/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21587: [SPARK-24588][SS] streaming join should require HashClus...

2018-06-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21587
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21598: [SPARK-24605][SQL] size(null) returns null instead of -1

2018-06-21 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/21598
  
@rxin yes we have, I think they are all listed in the [2.4 migration 
guide](https://github.com/apache/spark/blob/master/docs/sql-programming-guide.md#upgrading-from-spark-sql-23-to-24)

I've created https://issues.apache.org/jira/browse/SPARK-24625 to track it


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21587: [SPARK-24588][SS] streaming join should require HashClus...

2018-06-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21587
  
**[Test build #92177 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92177/testReport)**
 for PR 21587 at commit 
[`72466b0`](https://github.com/apache/spark/commit/72466b0026469379ff1ccce350b538f66c0b384a).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21587: [SPARK-24588][SS] streaming join should require HashClus...

2018-06-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21587
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21587: [SPARK-24588][SS] streaming join should require HashClus...

2018-06-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21587
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/381/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21061: [SPARK-23914][SQL] Add array_union function

2018-06-21 Thread kiszk
Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/21061#discussion_r197229189
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
 ---
@@ -2355,3 +2355,347 @@ case class ArrayRemove(left: Expression, right: 
Expression)
 
   override def prettyName: String = "array_remove"
 }
+
+object ArraySetLike {
+  def useGenericArrayData(elementSize: Int, length: Int): Boolean = {
+// Use the same calculation in UnsafeArrayData.fromPrimitiveArray()
+val headerInBytes = 
UnsafeArrayData.calculateHeaderPortionInBytes(length)
+val valueRegionInBytes = elementSize.toLong * length
+val totalSizeInLongs = (headerInBytes + valueRegionInBytes + 7) / 8
+totalSizeInLongs > Integer.MAX_VALUE / 8
+  }
+
+  def throwUnionLengthOverflowException(length: Int): Unit = {
+throw new RuntimeException(s"Unsuccessful try to union arrays with 
$length " +
+  s"elements due to exceeding the array size limit " +
+  s"${ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH}.")
+  }
+}
+
+
+abstract class ArraySetLike extends BinaryArrayExpressionWithImplicitCast {
+  override def dataType: DataType = left.dataType
+
+  override def checkInputDataTypes(): TypeCheckResult = {
+val typeCheckResult = super.checkInputDataTypes()
+if (typeCheckResult.isSuccess) {
+  
TypeUtils.checkForOrderingExpr(dataType.asInstanceOf[ArrayType].elementType,
+s"function $prettyName")
+} else {
+  typeCheckResult
+}
+  }
+
+  @transient protected lazy val ordering: Ordering[Any] =
+TypeUtils.getInterpretedOrdering(elementType)
+
+  @transient protected lazy val elementTypeSupportEquals = elementType 
match {
+case BinaryType => false
+case _: AtomicType => true
+case _ => false
+  }
+}
+
+/**
+ * Returns an array of the elements in the union of x and y, without 
duplicates
+ */
+@ExpressionDescription(
+  usage = """
+_FUNC_(array1, array2) - Returns an array of the elements in the union 
of array1 and array2,
+  without duplicates.
+  """,
+  examples = """
+Examples:
+  > SELECT _FUNC_(array(1, 2, 3), array(1, 3, 5));
+   array(1, 2, 3, 5)
+  """,
+  since = "2.4.0")
+case class ArrayUnion(left: Expression, right: Expression) extends 
ArraySetLike {
+  var hsInt: OpenHashSet[Int] = _
+  var hsLong: OpenHashSet[Long] = _
+
+  def assignInt(array: ArrayData, idx: Int, resultArray: ArrayData, pos: 
Int): Boolean = {
+val elem = array.getInt(idx)
+if (!hsInt.contains(elem)) {
+  resultArray.setInt(pos, elem)
+  hsInt.add(elem)
+  true
+} else {
+  false
+}
+  }
+
+  def assignLong(array: ArrayData, idx: Int, resultArray: ArrayData, pos: 
Int): Boolean = {
+val elem = array.getLong(idx)
+if (!hsLong.contains(elem)) {
+  resultArray.setLong(pos, elem)
+  hsLong.add(elem)
+  true
+} else {
+  false
+}
+  }
+
+  def evalPrimitiveType(
+  array1: ArrayData,
+  array2: ArrayData,
+  size: Int,
+  resultArray: ArrayData,
+  isLongType: Boolean): ArrayData = {
+// store elements into resultArray
+var foundNullElement = false
+var pos = 0
+Seq(array1, array2).foreach(array => {
+  var i = 0
+  while (i < array.numElements()) {
+if (array.isNullAt(i)) {
+  if (!foundNullElement) {
+resultArray.setNullAt(pos)
+pos += 1
+foundNullElement = true
+  }
+} else {
+  val assigned = if (!isLongType) {
+assignInt(array, i, resultArray, pos)
+  } else {
+assignLong(array, i, resultArray, pos)
+  }
+  if (assigned) {
+pos += 1
+  }
+}
+i += 1
+  }
+})
+resultArray
+  }
+
+  override def nullSafeEval(input1: Any, input2: Any): Any = {
+val array1 = input1.asInstanceOf[ArrayData]
+val array2 = input2.asInstanceOf[ArrayData]
+
+if (elementTypeSupportEquals) {
+  elementType match {
+case IntegerType =>
+  // avoid boxing of primitive int array elements
+  // calculate result array size
+  val hsSize = new OpenHashSet[Int]
+  Seq(array1, array2).foreach(array => {
+var i = 0
+while (i < array.numElements()) {
+  if (hsSize.size > ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH) 
{
+ArraySetLi

[GitHub] spark issue #21598: [SPARK-24605][SQL] size(null) returns null instead of -1

2018-06-21 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/21598
  
Do we have other "legacy" configs that we haven't released and can change 
to match this prefix? It's pretty nice to have a single prefix for stuff like 
this.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21061: [SPARK-23914][SQL] Add array_union function

2018-06-21 Thread kiszk
Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/21061#discussion_r197229074
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
 ---
@@ -2355,3 +2355,347 @@ case class ArrayRemove(left: Expression, right: 
Expression)
 
   override def prettyName: String = "array_remove"
 }
+
+object ArraySetLike {
+  def useGenericArrayData(elementSize: Int, length: Int): Boolean = {
+// Use the same calculation in UnsafeArrayData.fromPrimitiveArray()
+val headerInBytes = 
UnsafeArrayData.calculateHeaderPortionInBytes(length)
+val valueRegionInBytes = elementSize.toLong * length
+val totalSizeInLongs = (headerInBytes + valueRegionInBytes + 7) / 8
+totalSizeInLongs > Integer.MAX_VALUE / 8
+  }
+
+  def throwUnionLengthOverflowException(length: Int): Unit = {
+throw new RuntimeException(s"Unsuccessful try to union arrays with 
$length " +
+  s"elements due to exceeding the array size limit " +
+  s"${ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH}.")
+  }
+}
+
+
+abstract class ArraySetLike extends BinaryArrayExpressionWithImplicitCast {
+  override def dataType: DataType = left.dataType
+
+  override def checkInputDataTypes(): TypeCheckResult = {
+val typeCheckResult = super.checkInputDataTypes()
+if (typeCheckResult.isSuccess) {
+  
TypeUtils.checkForOrderingExpr(dataType.asInstanceOf[ArrayType].elementType,
+s"function $prettyName")
+} else {
+  typeCheckResult
+}
+  }
+
+  @transient protected lazy val ordering: Ordering[Any] =
+TypeUtils.getInterpretedOrdering(elementType)
+
+  @transient protected lazy val elementTypeSupportEquals = elementType 
match {
+case BinaryType => false
+case _: AtomicType => true
+case _ => false
+  }
+}
+
+/**
+ * Returns an array of the elements in the union of x and y, without 
duplicates
+ */
+@ExpressionDescription(
+  usage = """
+_FUNC_(array1, array2) - Returns an array of the elements in the union 
of array1 and array2,
+  without duplicates.
+  """,
+  examples = """
+Examples:
+  > SELECT _FUNC_(array(1, 2, 3), array(1, 3, 5));
+   array(1, 2, 3, 5)
+  """,
+  since = "2.4.0")
+case class ArrayUnion(left: Expression, right: Expression) extends 
ArraySetLike {
+  var hsInt: OpenHashSet[Int] = _
+  var hsLong: OpenHashSet[Long] = _
+
+  def assignInt(array: ArrayData, idx: Int, resultArray: ArrayData, pos: 
Int): Boolean = {
+val elem = array.getInt(idx)
+if (!hsInt.contains(elem)) {
+  resultArray.setInt(pos, elem)
+  hsInt.add(elem)
+  true
+} else {
+  false
+}
+  }
+
+  def assignLong(array: ArrayData, idx: Int, resultArray: ArrayData, pos: 
Int): Boolean = {
+val elem = array.getLong(idx)
+if (!hsLong.contains(elem)) {
+  resultArray.setLong(pos, elem)
+  hsLong.add(elem)
+  true
+} else {
+  false
+}
+  }
+
+  def evalPrimitiveType(
+  array1: ArrayData,
+  array2: ArrayData,
+  size: Int,
+  resultArray: ArrayData,
+  isLongType: Boolean): ArrayData = {
+// store elements into resultArray
+var foundNullElement = false
+var pos = 0
+Seq(array1, array2).foreach(array => {
+  var i = 0
+  while (i < array.numElements()) {
+if (array.isNullAt(i)) {
+  if (!foundNullElement) {
+resultArray.setNullAt(pos)
+pos += 1
+foundNullElement = true
+  }
+} else {
+  val assigned = if (!isLongType) {
+assignInt(array, i, resultArray, pos)
+  } else {
+assignLong(array, i, resultArray, pos)
+  }
+  if (assigned) {
+pos += 1
+  }
+}
+i += 1
+  }
+})
+resultArray
+  }
+
+  override def nullSafeEval(input1: Any, input2: Any): Any = {
+val array1 = input1.asInstanceOf[ArrayData]
+val array2 = input2.asInstanceOf[ArrayData]
+
+if (elementTypeSupportEquals) {
+  elementType match {
+case IntegerType =>
+  // avoid boxing of primitive int array elements
+  // calculate result array size
+  val hsSize = new OpenHashSet[Int]
+  Seq(array1, array2).foreach(array => {
+var i = 0
+while (i < array.numElements()) {
+  if (hsSize.size > ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH) 
{
+ArraySetLi

[GitHub] spark issue #21577: [SPARK-24589][core] Correctly identify tasks in output c...

2018-06-21 Thread tgravescs
Github user tgravescs commented on the issue:

https://github.com/apache/spark/pull/21577
  
I will 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

2018-06-21 Thread tgravescs
Github user tgravescs commented on the issue:

https://github.com/apache/spark/pull/21558
  
If you are working on this, I'll merge the other one and wait for you and 
continue to investigate in parallel


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21061: [SPARK-23914][SQL] Add array_union function

2018-06-21 Thread kiszk
Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/21061#discussion_r197228219
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
 ---
@@ -2355,3 +2355,347 @@ case class ArrayRemove(left: Expression, right: 
Expression)
 
   override def prettyName: String = "array_remove"
 }
+
+object ArraySetLike {
+  def useGenericArrayData(elementSize: Int, length: Int): Boolean = {
+// Use the same calculation in UnsafeArrayData.fromPrimitiveArray()
+val headerInBytes = 
UnsafeArrayData.calculateHeaderPortionInBytes(length)
+val valueRegionInBytes = elementSize.toLong * length
+val totalSizeInLongs = (headerInBytes + valueRegionInBytes + 7) / 8
+totalSizeInLongs > Integer.MAX_VALUE / 8
+  }
+
+  def throwUnionLengthOverflowException(length: Int): Unit = {
+throw new RuntimeException(s"Unsuccessful try to union arrays with 
$length " +
+  s"elements due to exceeding the array size limit " +
+  s"${ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH}.")
+  }
+}
+
+
+abstract class ArraySetLike extends BinaryArrayExpressionWithImplicitCast {
+  override def dataType: DataType = left.dataType
+
+  override def checkInputDataTypes(): TypeCheckResult = {
+val typeCheckResult = super.checkInputDataTypes()
+if (typeCheckResult.isSuccess) {
+  
TypeUtils.checkForOrderingExpr(dataType.asInstanceOf[ArrayType].elementType,
+s"function $prettyName")
+} else {
+  typeCheckResult
+}
+  }
+
+  @transient protected lazy val ordering: Ordering[Any] =
+TypeUtils.getInterpretedOrdering(elementType)
+
+  @transient protected lazy val elementTypeSupportEquals = elementType 
match {
+case BinaryType => false
+case _: AtomicType => true
+case _ => false
+  }
+}
+
+/**
+ * Returns an array of the elements in the union of x and y, without 
duplicates
+ */
+@ExpressionDescription(
+  usage = """
+_FUNC_(array1, array2) - Returns an array of the elements in the union 
of array1 and array2,
+  without duplicates.
+  """,
+  examples = """
+Examples:
+  > SELECT _FUNC_(array(1, 2, 3), array(1, 3, 5));
+   array(1, 2, 3, 5)
+  """,
+  since = "2.4.0")
+case class ArrayUnion(left: Expression, right: Expression) extends 
ArraySetLike {
+  var hsInt: OpenHashSet[Int] = _
+  var hsLong: OpenHashSet[Long] = _
+
+  def assignInt(array: ArrayData, idx: Int, resultArray: ArrayData, pos: 
Int): Boolean = {
+val elem = array.getInt(idx)
+if (!hsInt.contains(elem)) {
+  resultArray.setInt(pos, elem)
+  hsInt.add(elem)
+  true
+} else {
+  false
+}
+  }
+
+  def assignLong(array: ArrayData, idx: Int, resultArray: ArrayData, pos: 
Int): Boolean = {
+val elem = array.getLong(idx)
+if (!hsLong.contains(elem)) {
+  resultArray.setLong(pos, elem)
+  hsLong.add(elem)
+  true
+} else {
+  false
+}
+  }
+
+  def evalPrimitiveType(
+  array1: ArrayData,
+  array2: ArrayData,
+  size: Int,
+  resultArray: ArrayData,
+  isLongType: Boolean): ArrayData = {
+// store elements into resultArray
+var foundNullElement = false
+var pos = 0
+Seq(array1, array2).foreach(array => {
+  var i = 0
+  while (i < array.numElements()) {
+if (array.isNullAt(i)) {
+  if (!foundNullElement) {
+resultArray.setNullAt(pos)
+pos += 1
+foundNullElement = true
+  }
+} else {
+  val assigned = if (!isLongType) {
+assignInt(array, i, resultArray, pos)
+  } else {
+assignLong(array, i, resultArray, pos)
+  }
+  if (assigned) {
+pos += 1
+  }
+}
+i += 1
+  }
+})
+resultArray
+  }
+
+  override def nullSafeEval(input1: Any, input2: Any): Any = {
+val array1 = input1.asInstanceOf[ArrayData]
+val array2 = input2.asInstanceOf[ArrayData]
+
+if (elementTypeSupportEquals) {
+  elementType match {
+case IntegerType =>
+  // avoid boxing of primitive int array elements
+  // calculate result array size
+  val hsSize = new OpenHashSet[Int]
+  Seq(array1, array2).foreach(array => {
+var i = 0
+while (i < array.numElements()) {
+  if (hsSize.size > ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH) 
{
+ArraySetLi

[GitHub] spark pull request #21598: [SPARK-24605][SQL] size(null) returns null instea...

2018-06-21 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21598#discussion_r197227370
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -1314,6 +1314,13 @@ object SQLConf {
   "Other column values can be ignored during parsing even if they are 
malformed.")
 .booleanConf
 .createWithDefault(true)
+
+  val LEGACY_SIZE_OF_NULL = buildConf("spark.sql.legacy.sizeOfNull")
--- End diff --

I've created https://issues.apache.org/jira/browse/SPARK-24625 to track it.

It's similar to 
https://github.com/apache/spark/pull/21427#issuecomment-396142545 , but as I 
replied in that PR, having version specific config is an overkill, while 
`legacy` is simpler and more explicit that it will be removed in the future.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

2018-06-21 Thread tgravescs
Github user tgravescs commented on the issue:

https://github.com/apache/spark/pull/21558
  
Ah ok, was looking at my own version as well.  There are other things we 
should update for v2 as well, other functions with the variable names, 
description in DataWriterFactory.java, etc.
 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21558: [SPARK-24552][SQL] Use task ID instead of attempt...

2018-06-21 Thread tgravescs
Github user tgravescs commented on a diff in the pull request:

https://github.com/apache/spark/pull/21558#discussion_r197222759
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2.scala
 ---
@@ -110,7 +108,7 @@ object DataWritingSparkTask extends Logging {
   useCommitCoordinator: Boolean): WriterCommitMessage = {
 val stageId = context.stageId()
 val partId = context.partitionId()
-val attemptId = context.attemptNumber()
+val attemptId = context.taskAttemptId().toInt
--- End diff --

HadoopWriteConfigUtil has the same issue, its a public interface and uses 
in for attempt number. 
it seems somewhat unlikely but more likely to be able to go over an int for 
task ids in spark then in say MapReduce.  we do have partitionId as an Int so 
if partitions go to Int and you have task failures then taskids could go over 
Int.  Looking at our options




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21599: [SPARK-24598][SQL] Overflow on arithmetic operations ret...

2018-06-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21599
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92172/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21599: [SPARK-24598][SQL] Overflow on arithmetic operations ret...

2018-06-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21599
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21599: [SPARK-24598][SQL] Overflow on arithmetic operations ret...

2018-06-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21599
  
**[Test build #92172 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92172/testReport)**
 for PR 21599 at commit 
[`9c3df7d`](https://github.com/apache/spark/commit/9c3df7d553f523581fe79a64a1bb167062728103).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

2018-06-21 Thread vanzin
Github user vanzin commented on the issue:

https://github.com/apache/spark/pull/21558
  
FYI, I'm preparing my own version of this PR with the remaining feedback 
addressed. Ryan was on paternity leave and I don't know whether he's done yet, 
so he may not be that responsive.

This will conflict with the output commit coordinator change in any case, 
so one of them needs to wait (and that one is further along).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21527: [SPARK-24519] Make the threshold for highly compressed m...

2018-06-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21527
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21527: [SPARK-24519] Make the threshold for highly compressed m...

2018-06-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21527
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92169/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21527: [SPARK-24519] Make the threshold for highly compressed m...

2018-06-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21527
  
**[Test build #92169 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92169/testReport)**
 for PR 21527 at commit 
[`4cb492e`](https://github.com/apache/spark/commit/4cb492ed483c73acdd8f77aef524582ce4dcedff).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21577: [SPARK-24589][core] Correctly identify tasks in output c...

2018-06-21 Thread vanzin
Github user vanzin commented on the issue:

https://github.com/apache/spark/pull/21577
  
So anyone wants to do the actual merging?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21598: [SPARK-24605][SQL] size(null) returns null instead of -1

2018-06-21 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/21598
  
This is not a "bug" and there is no "right" behavior in APIs. It's been 
defined as -1 since the very beginning (when was it added?), so we can't just 
change the default value in a feature release.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21565: [SPARK-24558][Core]wrong Idle Timeout value is used in c...

2018-06-21 Thread sandeep-katta
Github user sandeep-katta commented on the issue:

https://github.com/apache/spark/pull/21565
  
@cloud-fan can you please review this small piece of code and merge this PR


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20944: [SPARK-23831][SQL] Add org.apache.derby to IsolatedClien...

2018-06-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20944
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20944: [SPARK-23831][SQL] Add org.apache.derby to IsolatedClien...

2018-06-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20944
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92171/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20944: [SPARK-23831][SQL] Add org.apache.derby to IsolatedClien...

2018-06-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20944
  
**[Test build #92171 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92171/testReport)**
 for PR 20944 at commit 
[`7133d7a`](https://github.com/apache/spark/commit/7133d7aac3c1aaffbd8a9c4433060c4a9d488035).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21588: [SPARK-24590][BUILD] Make Jenkins tests passed with hado...

2018-06-21 Thread vanzin
Github user vanzin commented on the issue:

https://github.com/apache/spark/pull/21588
  
Let me phrase the question a different way: your title says "Make jenkins 
tests passed" [sic]. If you check this in, and we enable a jenkins job for 
hadoop 3, will it pass?

I'm 100% sure the answer is no.

So fix the Hive fork, then update this PR, and let's have it actually run 
through jenkins.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21548: [SPARK-24518][CORE] Using Hadoop credential provider API...

2018-06-21 Thread vanzin
Github user vanzin commented on the issue:

https://github.com/apache/spark/pull/21548
  
Could you update the summary so that it doesn't sound like this is an 
existing security issue?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20944: [SPARK-23831][SQL] Add org.apache.derby to IsolatedClien...

2018-06-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20944
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20944: [SPARK-23831][SQL] Add org.apache.derby to IsolatedClien...

2018-06-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20944
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92170/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20944: [SPARK-23831][SQL] Add org.apache.derby to IsolatedClien...

2018-06-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20944
  
**[Test build #92170 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92170/testReport)**
 for PR 20944 at commit 
[`da36564`](https://github.com/apache/spark/commit/da3656490c7863b74f6274d463b07abcd015ade5).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21596: [SPARK-24601] Bump Jackson version

2018-06-21 Thread MaxGekk
Github user MaxGekk commented on the issue:

https://github.com/apache/spark/pull/21596
  
Please, make sure that performance doesn't degrade after upgrading Jackson. 
You can check that by 
[JsonBenchmarks](https://github.com/apache/spark/blob/bd14da6fd5a77cc03efff193a84ffccbe892cc13/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonBenchmarks.scala).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21603: [SPARK-17091][SQL] Add rule to convert IN predica...

2018-06-21 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/21603#discussion_r197191390
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala
 ---
@@ -270,6 +270,11 @@ private[parquet] class ParquetFilters(pushDownDate: 
Boolean) {
   case sources.Not(pred) =>
 createFilter(schema, pred).map(FilterApi.not)
 
+  case sources.In(name, values) if canMakeFilterOn(name) && 
values.length < 20 =>
--- End diff --

spark.sql.parquet.pushdown.inFilterThreshold. By default, it should be 
around 10. Please also check the perf. 

cc @jiangxb1987 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16677: [SPARK-19355][SQL] Use map output statistics to improve ...

2018-06-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16677
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16677: [SPARK-19355][SQL] Use map output statistics to improve ...

2018-06-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16677
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92168/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16677: [SPARK-19355][SQL] Use map output statistics to improve ...

2018-06-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16677
  
**[Test build #92168 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92168/testReport)**
 for PR 16677 at commit 
[`5594bf9`](https://github.com/apache/spark/commit/5594bf9f13aa83d05a433bad0fd366daabd2d034).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21605: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

2018-06-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21605
  
**[Test build #92176 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92176/testReport)**
 for PR 21605 at commit 
[`fb8cd54`](https://github.com/apache/spark/commit/fb8cd5466c39b69822db145d17a2e40255e74b03).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21605: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

2018-06-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21605
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/380/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21605: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

2018-06-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21605
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21605: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

2018-06-21 Thread mgaido91
Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/21605
  
cc @cloud-fan @daniel-shields @WenboZhao 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

2018-06-21 Thread mgaido91
Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/21449
  
ok so I created https://github.com/apache/spark/pull/21605 for the fix 
proposed by @daniel-shields. I'd like to leave this open in order to go on with 
the discussion for a long-term better fix.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21605: [SPARK-24385][SQL] Resolve self-join condition am...

2018-06-21 Thread mgaido91
GitHub user mgaido91 opened a pull request:

https://github.com/apache/spark/pull/21605

[SPARK-24385][SQL] Resolve self-join condition ambiguity for EqualNullSafe

## What changes were proposed in this pull request?

In Dataset.join we have a small hack for resolving ambiguity in the column 
name for self-joins. The current code supports only `EqualTo`.

The PR extends the fix to `EqualNullSafe`.

Credit for this PR should be given to @daniel-shields.

## How was this patch tested?

added UT


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mgaido91/spark SPARK-24385_2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21605.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21605






---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21282: [SPARK-23934][SQL] Adding map_from_entries function

2018-06-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21282
  
**[Test build #92175 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92175/testReport)**
 for PR 21282 at commit 
[`4eaedc5`](https://github.com/apache/spark/commit/4eaedc50f92a3dd1ee2100fbbd5ac951344ece75).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21548: [SPARK-24518][CORE] Using Hadoop credential provi...

2018-06-21 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/21548#discussion_r197182301
  
--- Diff: core/src/main/scala/org/apache/spark/SSLOptions.scala ---
@@ -179,9 +185,11 @@ private[spark] object SSLOptions extends Logging {
 .orElse(defaults.flatMap(_.keyStore))
 
 val keyStorePassword = 
conf.getWithSubstitution(s"$ns.keyStorePassword")
+
.orElse(Option(hadoopConf.getPassword(s"$ns.keyStorePassword")).map(new 
String(_)))
--- End diff --

`new String` takes a charset. (In fact the constructor you're calling 
should be deprecated...)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

2018-06-21 Thread WenboZhao
Github user WenboZhao commented on the issue:

https://github.com/apache/spark/pull/21449
  
I like the proposal by @daniel-shields. If we could get it fixed soon, we 
will be able to catch up the Spark 2.3.2 release.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwrite a p...

2018-06-21 Thread steveloughran
Github user steveloughran commented on the issue:

https://github.com/apache/spark/pull/21257
  
some overall thought

* I think this is only happening on a successful job commit, not abort. 
This is the desired action?
* if something goes wrong here, is failing the entire job the correct 
action? If the deletes were happening earlier, then yes, the job would 
obviously fail. But now the core work has taken place, it's just cleanup 
failing. Which could be: permissions, transient network, etc. 

I'll have to look a bit closer at what happens in committer cleanups right 
now, though as they are focused on rm -f $dest/__temporary/$jobAttempt, they 
are less worried about failures here as it shoudn't be changing any public 
datasets


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21602: [SPARK-24613][SQL] Cache with UDF could not be matched w...

2018-06-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21602
  
**[Test build #92174 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92174/testReport)**
 for PR 21602 at commit 
[`377f213`](https://github.com/apache/spark/commit/377f2134e6b4990b4c1d080e5fd5119fe808e057).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21602: [SPARK-24613][SQL] Cache with UDF could not be matched w...

2018-06-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21602
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21602: [SPARK-24613][SQL] Cache with UDF could not be matched w...

2018-06-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21602
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/379/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwr...

2018-06-21 Thread steveloughran
Github user steveloughran commented on a diff in the pull request:

https://github.com/apache/spark/pull/21257#discussion_r197177156
  
--- Diff: 
core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala
 ---
@@ -235,4 +244,41 @@ class HadoopMapReduceCommitProtocol(
   tmp.getFileSystem(taskContext.getConfiguration).delete(tmp, false)
 }
   }
+
+  /**
+   * now just record the file to be delete
+   */
+  override def deleteWithJob(fs: FileSystem, path: Path,
--- End diff --

No need to worry about concurrent access here, correct? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21602: [SPARK-24613][SQL] Cache with UDF could not be matched w...

2018-06-21 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/21602
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwr...

2018-06-21 Thread steveloughran
Github user steveloughran commented on a diff in the pull request:

https://github.com/apache/spark/pull/21257#discussion_r197176461
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala
 ---
@@ -207,9 +210,23 @@ case class InsertIntoHadoopFsRelationCommand(
 }
 // first clear the path determined by the static partition keys (e.g. 
/table/foo=1)
 val staticPrefixPath = 
qualifiedOutputPath.suffix(staticPartitionPrefix)
-if (fs.exists(staticPrefixPath) && !committer.deleteWithJob(fs, 
staticPrefixPath, true)) {
-  throw new IOException(s"Unable to clear output " +
-s"directory $staticPrefixPath prior to writing to it")
+if (fs.exists(staticPrefixPath)) {
+  if (staticPartitionPrefix.isEmpty && outputCheck) {
+// input contain output, only delete output sub files when job 
commit
+  val files = fs.listFiles(staticPrefixPath, false)
+  while (files.hasNext) {
+val file = files.next()
+if (!committer.deleteWithJob(fs, file.getPath, false)) {
+  throw new IOException(s"Unable to clear output " +
+s"directory ${file.getPath} prior to writing to it")
+}
+  }
+  } else {
+if (!committer.deleteWithJob(fs, staticPrefixPath, true)) {
+  throw new IOException(s"Unable to clear output " +
--- End diff --

again, hard to see how this exception path would be reached.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwr...

2018-06-21 Thread steveloughran
Github user steveloughran commented on a diff in the pull request:

https://github.com/apache/spark/pull/21257#discussion_r197176292
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala
 ---
@@ -207,9 +210,23 @@ case class InsertIntoHadoopFsRelationCommand(
 }
 // first clear the path determined by the static partition keys (e.g. 
/table/foo=1)
 val staticPrefixPath = 
qualifiedOutputPath.suffix(staticPartitionPrefix)
-if (fs.exists(staticPrefixPath) && !committer.deleteWithJob(fs, 
staticPrefixPath, true)) {
-  throw new IOException(s"Unable to clear output " +
-s"directory $staticPrefixPath prior to writing to it")
+if (fs.exists(staticPrefixPath)) {
+  if (staticPartitionPrefix.isEmpty && outputCheck) {
+// input contain output, only delete output sub files when job 
commit
+  val files = fs.listFiles(staticPrefixPath, false)
--- End diff --

if there are a lot of files here, you've gone from a dir delete which was 
O(1) on a fileystem, probably O(descendant) on an object store to at 
O(children) on an FS, O(children * descendants (chlld)) op here.  Not 
significant for a small number of files, but could potentially be expensive. 
Why do the iteration at all?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwr...

2018-06-21 Thread steveloughran
Github user steveloughran commented on a diff in the pull request:

https://github.com/apache/spark/pull/21257#discussion_r197174835
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala
 ---
@@ -207,9 +210,23 @@ case class InsertIntoHadoopFsRelationCommand(
 }
 // first clear the path determined by the static partition keys (e.g. 
/table/foo=1)
 val staticPrefixPath = 
qualifiedOutputPath.suffix(staticPartitionPrefix)
-if (fs.exists(staticPrefixPath) && !committer.deleteWithJob(fs, 
staticPrefixPath, true)) {
-  throw new IOException(s"Unable to clear output " +
-s"directory $staticPrefixPath prior to writing to it")
+if (fs.exists(staticPrefixPath)) {
+  if (staticPartitionPrefix.isEmpty && outputCheck) {
+// input contain output, only delete output sub files when job 
commit
+  val files = fs.listFiles(staticPrefixPath, false)
+  while (files.hasNext) {
+val file = files.next()
+if (!committer.deleteWithJob(fs, file.getPath, false)) {
+  throw new IOException(s"Unable to clear output " +
--- End diff --

as `committer.deleteWithJob()` returns true in base class, that check won't 
do much, at least not with the default impl. Probably better just to have 
`deleteWithJob()` return Unit, require callers to raise an exception on a 
delete failure. Given that delete() is required to say "dest doesn't exist if 
you return", I don't think they need to do any checks at all


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



<    1   2   3   4   5   >