[GitHub] spark issue #21583: [SPARK-23984][K8S][Test] Added Integration Tests for PyS...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21583 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-make-spark-distribution-unified/385/ --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21508: [SPARK-24488] [SQL] Fix issue when generator is aliased ...
Github user bkrieger commented on the issue: https://github.com/apache/spark/pull/21508 @gatorsmile @hvanhovell can you take another look at this? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21583: [SPARK-23984][K8S][Test] Added Integration Tests for PyS...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21583 **[Test build #92182 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92182/testReport)** for PR 21583 at commit [`2707dee`](https://github.com/apache/spark/commit/2707dee967fd6c4cebbe96cc7ae40feb5bfced24). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21482: [SPARK-24393][SQL] SQL builtin: isinf
Github user henryr commented on the issue: https://github.com/apache/spark/pull/21482 Any further comments here? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21583: [SPARK-23984][K8S][Test] Added Integration Tests for PyS...
Github user ifilonenko commented on the issue: https://github.com/apache/spark/pull/21583 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21606: [SPARK-24552][core][SQL] Use task ID instead of a...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21606#discussion_r197263827 --- Diff: core/src/main/scala/org/apache/spark/internal/io/SparkHadoopWriter.scala --- @@ -104,12 +104,12 @@ object SparkHadoopWriter extends Logging { jobTrackerId: String, commitJobId: Int, sparkPartitionId: Int, - sparkAttemptNumber: Int, + sparkTaskId: Long, committer: FileCommitProtocol, iterator: Iterator[(K, V)]): TaskCommitMessage = { // Set up a task. val taskContext = config.createTaskAttemptContext( - jobTrackerId, commitJobId, sparkPartitionId, sparkAttemptNumber) + jobTrackerId, commitJobId, sparkPartitionId, sparkTaskId.toInt) --- End diff -- is it safe? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21606: [SPARK-24552][core][SQL] Use task ID instead of a...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21606#discussion_r197265481 --- Diff: core/src/main/scala/org/apache/spark/internal/io/SparkHadoopWriter.scala --- @@ -104,12 +104,12 @@ object SparkHadoopWriter extends Logging { jobTrackerId: String, commitJobId: Int, sparkPartitionId: Int, - sparkAttemptNumber: Int, + sparkTaskId: Long, committer: FileCommitProtocol, iterator: Iterator[(K, V)]): TaskCommitMessage = { // Set up a task. val taskContext = config.createTaskAttemptContext( - jobTrackerId, commitJobId, sparkPartitionId, sparkAttemptNumber) + jobTrackerId, commitJobId, sparkPartitionId, sparkTaskId.toInt) --- End diff -- task id is unique across the entire Spark application, which means we may have very large task id in a long-running micro-batch streaming application. If we do need an int here, I'd suggest we combine `stageAttemptNumber` and `taskAttemptNumber` into a int, which is much less risky.(Spark won't have a lot of stage/task attempts) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21606: [SPARK-24552][core][SQL] Use task ID instead of attempt ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21606 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21606: [SPARK-24552][core][SQL] Use task ID instead of attempt ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21606 **[Test build #92181 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92181/testReport)** for PR 21606 at commit [`7233a5f`](https://github.com/apache/spark/commit/7233a5fd7b154e2a1400c5fac11d0356a22f5f98). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21606: [SPARK-24552][core][SQL] Use task ID instead of attempt ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21606 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/384/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/21558 See link above for the updated PR. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21606: [SPARK-24552][core][SQL] Use task ID instead of attempt ...
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/21606 Credit here should go to @rdblue when merging. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21606: [SPARK-24552][core][SQL] Use task ID instead of a...
GitHub user vanzin opened a pull request: https://github.com/apache/spark/pull/21606 [SPARK-24552][core][SQL] Use task ID instead of attempt number for writes. This passes the unique task attempt id instead of attempt number to v2 data sources because attempt number is reused when stages are retried. When attempt numbers are reused, sources that track data by partition id and attempt number may incorrectly clean up data because the same attempt number can be both committed and aborted. You can merge this pull request into a Git repository by running: $ git pull https://github.com/vanzin/spark SPARK-24552.2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21606.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21606 commit 6c60d1462c34f01610ada50c989832775b6fd117 Author: Ryan Blue Date: 2018-06-13T19:50:00Z SPARK-24552: Use task ID instead of attempt number for v2 writes. commit 2e6552460eed3013e649b06b16a1d14b1e542e2d Author: Marcelo Vanzin Date: 2018-06-21T17:21:00Z Rename attemptId -> taskId for clarity. commit 3561723341c3062ba7d8682ea272c549b4bdc245 Author: Marcelo Vanzin Date: 2018-06-21T17:28:12Z Use task ID instead of attempt for the Hadoop API too. commit d5a079d439740f3067722d4e8c9e8e94f292017c Author: Marcelo Vanzin Date: 2018-06-21T18:37:54Z Merge branch 'master' into SPARK-24552.2 commit fdcd39c852e9a2d70da95c37da04190910e7b2f0 Author: Marcelo Vanzin Date: 2018-06-21T18:51:48Z Log message update. commit 7233a5fd7b154e2a1400c5fac11d0356a22f5f98 Author: Marcelo Vanzin Date: 2018-06-21T18:57:02Z Javadoc updates. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21282: [SPARK-23934][SQL] Adding map_from_entries function
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21282 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21282: [SPARK-23934][SQL] Adding map_from_entries function
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21282 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92175/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21282: [SPARK-23934][SQL] Adding map_from_entries function
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21282 **[Test build #92175 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92175/testReport)** for PR 21282 at commit [`4eaedc5`](https://github.com/apache/spark/commit/4eaedc50f92a3dd1ee2100fbbd5ac951344ece75). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21542: [SPARK-24529][Build][test-maven] Add spotbugs into maven...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21542 **[Test build #92180 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92180/testReport)** for PR 21542 at commit [`86ef42c`](https://github.com/apache/spark/commit/86ef42ceaa3bdc623fa0b01f3ea076cf0f63902a). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21542: [SPARK-24529][Build][test-maven] Add spotbugs into maven...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21542 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/383/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21542: [SPARK-24529][Build][test-maven] Add spotbugs into maven...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21542 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21605: [SPARK-24385][SQL] Resolve self-join condition ambiguity...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21605 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21605: [SPARK-24385][SQL] Resolve self-join condition ambiguity...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21605 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92176/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21605: [SPARK-24385][SQL] Resolve self-join condition ambiguity...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21605 **[Test build #92176 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92176/testReport)** for PR 21605 at commit [`fb8cd54`](https://github.com/apache/spark/commit/fb8cd5466c39b69822db145d17a2e40255e74b03). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21594: [SPARK-24596][SQL] Non-cascading Cache Invalidati...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21594#discussion_r197247925 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala --- @@ -105,24 +105,58 @@ class CacheManager extends Logging { } /** - * Un-cache all the cache entries that refer to the given plan. + * Un-cache the given plan or all the cache entries that refer to the given plan. + * @param query The [[Dataset]] to be un-cached. + * @param cascade If true, un-cache all the cache entries that refer to the given + * [[Dataset]]; otherwise un-cache the given [[Dataset]] only. + * @param blocking Whether to block until all blocks are deleted. */ - def uncacheQuery(query: Dataset[_], blocking: Boolean = true): Unit = writeLock { -uncacheQuery(query.sparkSession, query.logicalPlan, blocking) + def uncacheQuery(query: Dataset[_], +cascade: Boolean, blocking: Boolean = true): Unit = writeLock { +uncacheQuery(query.sparkSession, query.logicalPlan, cascade, blocking) } /** - * Un-cache all the cache entries that refer to the given plan. + * Un-cache the given plan or all the cache entries that refer to the given plan. + * @param spark The Spark session. + * @param plan The plan to be un-cached. + * @param cascade If true, un-cache all the cache entries that refer to the given + * plan; otherwise un-cache the given plan only. + * @param blocking Whether to block until all blocks are deleted. */ - def uncacheQuery(spark: SparkSession, plan: LogicalPlan, blocking: Boolean): Unit = writeLock { + def uncacheQuery(spark: SparkSession, plan: LogicalPlan, +cascade: Boolean, blocking: Boolean): Unit = writeLock { +val condition: LogicalPlan => Boolean = --- End diff -- `condition` -> `shouldUnCache`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21594: [SPARK-24596][SQL] Non-cascading Cache Invalidati...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21594#discussion_r197247763 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala --- @@ -105,24 +105,58 @@ class CacheManager extends Logging { } /** - * Un-cache all the cache entries that refer to the given plan. + * Un-cache the given plan or all the cache entries that refer to the given plan. + * @param query The [[Dataset]] to be un-cached. + * @param cascade If true, un-cache all the cache entries that refer to the given + * [[Dataset]]; otherwise un-cache the given [[Dataset]] only. + * @param blocking Whether to block until all blocks are deleted. */ - def uncacheQuery(query: Dataset[_], blocking: Boolean = true): Unit = writeLock { -uncacheQuery(query.sparkSession, query.logicalPlan, blocking) + def uncacheQuery(query: Dataset[_], +cascade: Boolean, blocking: Boolean = true): Unit = writeLock { +uncacheQuery(query.sparkSession, query.logicalPlan, cascade, blocking) } /** - * Un-cache all the cache entries that refer to the given plan. + * Un-cache the given plan or all the cache entries that refer to the given plan. + * @param spark The Spark session. + * @param plan The plan to be un-cached. + * @param cascade If true, un-cache all the cache entries that refer to the given + * plan; otherwise un-cache the given plan only. + * @param blocking Whether to block until all blocks are deleted. */ - def uncacheQuery(spark: SparkSession, plan: LogicalPlan, blocking: Boolean): Unit = writeLock { + def uncacheQuery(spark: SparkSession, plan: LogicalPlan, +cascade: Boolean, blocking: Boolean): Unit = writeLock { --- End diff -- ditto --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21594: [SPARK-24596][SQL] Non-cascading Cache Invalidati...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21594#discussion_r197247440 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -2971,7 +2971,7 @@ class Dataset[T] private[sql]( * @since 1.6.0 */ def unpersist(blocking: Boolean): this.type = { -sparkSession.sharedState.cacheManager.uncacheQuery(this, blocking) +sparkSession.sharedState.cacheManager.uncacheQuery(this, false, blocking) --- End diff -- nit: it's clearer to write `cascade =false` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21594: [SPARK-24596][SQL] Non-cascading Cache Invalidati...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21594#discussion_r197247632 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala --- @@ -105,24 +105,58 @@ class CacheManager extends Logging { } /** - * Un-cache all the cache entries that refer to the given plan. + * Un-cache the given plan or all the cache entries that refer to the given plan. + * @param query The [[Dataset]] to be un-cached. + * @param cascade If true, un-cache all the cache entries that refer to the given + * [[Dataset]]; otherwise un-cache the given [[Dataset]] only. + * @param blocking Whether to block until all blocks are deleted. */ - def uncacheQuery(query: Dataset[_], blocking: Boolean = true): Unit = writeLock { -uncacheQuery(query.sparkSession, query.logicalPlan, blocking) + def uncacheQuery(query: Dataset[_], +cascade: Boolean, blocking: Boolean = true): Unit = writeLock { --- End diff -- nit ``` def f( param1: X, param2: Y) ``` 4 space indentation. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21599: [SPARK-24598][SQL] Overflow on arithmetic operati...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/21599#discussion_r197246916 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala --- @@ -128,17 +128,31 @@ abstract class BinaryArithmetic extends BinaryOperator with NullIntolerant { def calendarIntervalMethod: String = sys.error("BinaryArithmetics must override either calendarIntervalMethod or genCode") + def checkOverflowCode(result: String, op1: String, op2: String): String = +sys.error("BinaryArithmetics must override either checkOverflowCode or genCode") + override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = dataType match { case _: DecimalType => defineCodeGen(ctx, ev, (eval1, eval2) => s"$eval1.$decimalMethod($eval2)") case CalendarIntervalType => defineCodeGen(ctx, ev, (eval1, eval2) => s"$eval1.$calendarIntervalMethod($eval2)") +// In the following cases, overflow can happen, so we need to check the result is valid. +// Otherwise we throw an ArithmeticException --- End diff -- Personally, I am quite against returning null. It is not something a user expects, so he/she is likely not to check for it (when I see a NULL myself, I think that one of the 2 operands was NULL, not that an overflow occurred), so he/she won't realize the issue and would find corrupted data. Moreover, this is not how RDBMS behaves and it is against SQL standard. So I think that the behavior which was chosen for DECIMAL was wrong and I'd prefer not to introduce the same behavior also in other places. Anyway I see your point about consistency over the codebase and it makes sense. I'd love to know @gatorsmile and @hvanhovell's opinions too. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21537: [SPARK-24505][SQL] Convert strings in codegen to ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21537#discussion_r197245629 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/javaCode.scala --- @@ -256,6 +283,22 @@ object EmptyBlock extends Block with Serializable { override def + (other: Block): Block = other } +/** + * A block inlines all types of input arguments into a string without + * tracking any reference of `JavaCode` instances. + */ +case class InlineBlock(block: String) extends Block { + override val code: String = block + override val exprValues: Set[ExprValue] = Set.empty + + override def + (other: Block): Block = other match { +case c: CodeBlock => Blocks(Seq(this, c)) +case i: InlineBlock => InlineBlock(block + i.block) +case b: Blocks => Blocks(Seq(this) ++ b.blocks) --- End diff -- shall we do that PR first? I feel it's easier to review this PR after we clean up the `Block` framework. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21537: [SPARK-24505][SQL] Convert strings in codegen to ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21537#discussion_r197245172 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala --- @@ -1004,26 +1012,29 @@ case class Cast(child: Expression, dataType: DataType, timeZoneId: Option[String private[this] def castToIntervalCode(from: DataType): CastFunction = from match { case StringType => (c, evPrim, evNull) => -s"""$evPrim = CalendarInterval.fromString($c.toString()); +code"""$evPrim = CalendarInterval.fromString($c.toString()); if(${evPrim} == null) { ${evNull} = true; } """.stripMargin } - private[this] def decimalToTimestampCode(d: String): String = -s"($d.toBigDecimal().bigDecimal().multiply(new java.math.BigDecimal(100L))).longValue()" - private[this] def longToTimeStampCode(l: String): String = s"$l * 100L" - private[this] def timestampToIntegerCode(ts: String): String = -s"java.lang.Math.floor((double) $ts / 100L)" - private[this] def timestampToDoubleCode(ts: String): String = s"$ts / 100.0" + private[this] def decimalToTimestampCode(d: ExprValue): Block = { +val block = code"new java.math.BigDecimal(100L)" +code"($d.toBigDecimal().bigDecimal().multiply($block)).longValue()" + } + private[this] def longToTimeStampCode(l: ExprValue): Block = code"$l * 100L" + private[this] def timestampToIntegerCode(ts: ExprValue): Block = +code"java.lang.Math.floor((double) $ts / 100L)" + private[this] def timestampToDoubleCode(ts: ExprValue): Block = +code"$ts / 100.0" private[this] def castToBooleanCode(from: DataType): CastFunction = from match { case StringType => - val stringUtils = StringUtils.getClass.getName.stripSuffix("$") + val stringUtils = inline"${StringUtils.getClass.getName.stripSuffix("$")}" --- End diff -- what's the difference between inline and code? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21602: [SPARK-24613][SQL] Cache with UDF could not be ma...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/21602 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21602: [SPARK-24613][SQL] Cache with UDF could not be matched w...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21602 Thanks! Merged to master/2.3 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21602: [SPARK-24613][SQL] Cache with UDF could not be matched w...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21602 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92174/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21602: [SPARK-24613][SQL] Cache with UDF could not be matched w...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21602 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21565: [SPARK-24558][Core]wrong Idle Timeout value is used in c...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21565 **[Test build #92179 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92179/testReport)** for PR 21565 at commit [`a5708cc`](https://github.com/apache/spark/commit/a5708cc59786edeae792e926821d8c26b01ee39d). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21602: [SPARK-24613][SQL] Cache with UDF could not be matched w...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21602 **[Test build #92174 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92174/testReport)** for PR 21602 at commit [`377f213`](https://github.com/apache/spark/commit/377f2134e6b4990b4c1d080e5fd5119fe808e057). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21565: [SPARK-24558][Core]wrong Idle Timeout value is used in c...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21565 **[Test build #92178 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92178/testReport)** for PR 21565 at commit [`a5708cc`](https://github.com/apache/spark/commit/a5708cc59786edeae792e926821d8c26b01ee39d). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21565: [SPARK-24558][Core]wrong Idle Timeout value is used in c...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/21565 ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21565: [SPARK-24558][Core]wrong Idle Timeout value is us...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21565#discussion_r197237620 --- Diff: core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala --- @@ -488,9 +488,16 @@ private[spark] class ExecutorAllocationManager( newExecutorTotal = numExistingExecutors if (testing || executorsRemoved.nonEmpty) { executorsRemoved.foreach { removedExecutorId => +// If it is cachedBlcok timeout is configured using +// spark.dynamicAllocation.cachedExecutorIdleTimeout --- End diff -- can you refine the wording? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21599: [SPARK-24598][SQL] Overflow on arithmetic operati...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21599#discussion_r197235683 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala --- @@ -128,17 +128,31 @@ abstract class BinaryArithmetic extends BinaryOperator with NullIntolerant { def calendarIntervalMethod: String = sys.error("BinaryArithmetics must override either calendarIntervalMethod or genCode") + def checkOverflowCode(result: String, op1: String, op2: String): String = +sys.error("BinaryArithmetics must override either checkOverflowCode or genCode") + override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = dataType match { case _: DecimalType => defineCodeGen(ctx, ev, (eval1, eval2) => s"$eval1.$decimalMethod($eval2)") case CalendarIntervalType => defineCodeGen(ctx, ev, (eval1, eval2) => s"$eval1.$calendarIntervalMethod($eval2)") +// In the following cases, overflow can happen, so we need to check the result is valid. +// Otherwise we throw an ArithmeticException --- End diff -- In current Spark we are very conservative about runtime error, as it may break the data pipeline middle away, and returning null is a commonly used strategy. Shall we follow it here? We can throw exception when we have a strict mode. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21061: [SPARK-23914][SQL] Add array_union function
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/21061#discussion_r197235630 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala --- @@ -2355,3 +2355,347 @@ case class ArrayRemove(left: Expression, right: Expression) override def prettyName: String = "array_remove" } + +object ArraySetLike { + def useGenericArrayData(elementSize: Int, length: Int): Boolean = { +// Use the same calculation in UnsafeArrayData.fromPrimitiveArray() +val headerInBytes = UnsafeArrayData.calculateHeaderPortionInBytes(length) +val valueRegionInBytes = elementSize.toLong * length +val totalSizeInLongs = (headerInBytes + valueRegionInBytes + 7) / 8 +totalSizeInLongs > Integer.MAX_VALUE / 8 + } + + def throwUnionLengthOverflowException(length: Int): Unit = { +throw new RuntimeException(s"Unsuccessful try to union arrays with $length " + + s"elements due to exceeding the array size limit " + + s"${ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH}.") + } +} + + +abstract class ArraySetLike extends BinaryArrayExpressionWithImplicitCast { + override def dataType: DataType = left.dataType + + override def checkInputDataTypes(): TypeCheckResult = { +val typeCheckResult = super.checkInputDataTypes() +if (typeCheckResult.isSuccess) { + TypeUtils.checkForOrderingExpr(dataType.asInstanceOf[ArrayType].elementType, +s"function $prettyName") +} else { + typeCheckResult +} + } + + @transient protected lazy val ordering: Ordering[Any] = +TypeUtils.getInterpretedOrdering(elementType) + + @transient protected lazy val elementTypeSupportEquals = elementType match { +case BinaryType => false +case _: AtomicType => true +case _ => false + } +} + +/** + * Returns an array of the elements in the union of x and y, without duplicates + */ +@ExpressionDescription( + usage = """ +_FUNC_(array1, array2) - Returns an array of the elements in the union of array1 and array2, + without duplicates. + """, + examples = """ +Examples: + > SELECT _FUNC_(array(1, 2, 3), array(1, 3, 5)); + array(1, 2, 3, 5) + """, + since = "2.4.0") +case class ArrayUnion(left: Expression, right: Expression) extends ArraySetLike { + var hsInt: OpenHashSet[Int] = _ + var hsLong: OpenHashSet[Long] = _ + + def assignInt(array: ArrayData, idx: Int, resultArray: ArrayData, pos: Int): Boolean = { +val elem = array.getInt(idx) +if (!hsInt.contains(elem)) { + resultArray.setInt(pos, elem) + hsInt.add(elem) + true +} else { + false +} + } + + def assignLong(array: ArrayData, idx: Int, resultArray: ArrayData, pos: Int): Boolean = { +val elem = array.getLong(idx) +if (!hsLong.contains(elem)) { + resultArray.setLong(pos, elem) + hsLong.add(elem) + true +} else { + false +} + } + + def evalPrimitiveType( + array1: ArrayData, + array2: ArrayData, + size: Int, + resultArray: ArrayData, + isLongType: Boolean): ArrayData = { +// store elements into resultArray +var foundNullElement = false +var pos = 0 +Seq(array1, array2).foreach(array => { + var i = 0 + while (i < array.numElements()) { +if (array.isNullAt(i)) { + if (!foundNullElement) { +resultArray.setNullAt(pos) +pos += 1 +foundNullElement = true + } +} else { + val assigned = if (!isLongType) { +assignInt(array, i, resultArray, pos) + } else { +assignLong(array, i, resultArray, pos) + } + if (assigned) { +pos += 1 + } +} +i += 1 + } +}) +resultArray + } + + override def nullSafeEval(input1: Any, input2: Any): Any = { +val array1 = input1.asInstanceOf[ArrayData] +val array2 = input2.asInstanceOf[ArrayData] + +if (elementTypeSupportEquals) { + elementType match { +case IntegerType => + // avoid boxing of primitive int array elements + // calculate result array size + val hsSize = new OpenHashSet[Int] + Seq(array1, array2).foreach(array => { +var i = 0 +while (i < array.numElements()) { + if (hsSize.size > ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH) { +ArraySetLi
[GitHub] spark pull request #21577: [SPARK-24589][core] Correctly identify tasks in o...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/21577 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21577: [SPARK-24589][core] Correctly identify tasks in output c...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/21577 > I pushed the change for that in: vanzin/spark@e6a862e I like it, it's simpler to use task id to replace stage attempt id and task attempt id. For safety we should do it in master only after this PR is merged. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21482: [SPARK-24393][SQL] SQL builtin: isinf
Github user henryr commented on a diff in the pull request: https://github.com/apache/spark/pull/21482#discussion_r197234189 --- Diff: python/pyspark/sql/functions.py --- @@ -468,6 +468,18 @@ def input_file_name(): return Column(sc._jvm.functions.input_file_name()) +@since(2.4) +def isinf(col): --- End diff -- @HyukjinKwon could you clarify, please? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21598: [SPARK-24605][SQL] size(null) returns null instead of -1
Github user markhamstra commented on the issue: https://github.com/apache/spark/pull/21598 > so we can't just change the default value in a feature release Agreed. Once a particular interface and behavior is in our released public API, then we effectively have a contract not to change that behavior. If we are going to provide another behavior before making a new major-number release (e.g. spark-3.0.0), then we have to provide a user configuration option to select that new behavior, and the default behavior if a user doesn't change configuration must be the same as before the optional new behavior. If there is a clear, severe bug (such as data loss or corruption), only then we can consider changing the public API before making a new major-number release -- but even then we are likely to either go immediately to a new major-number or to at least preserve the old, buggy behavior with a configuration option. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21587: [SPARK-24588][SS] streaming join should require HashClus...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21587 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/382/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21587: [SPARK-24588][SS] streaming join should require HashClus...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21587 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21598: [SPARK-24605][SQL] size(null) returns null instead of -1
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/21598 @rxin yes we have, I think they are all listed in the [2.4 migration guide](https://github.com/apache/spark/blob/master/docs/sql-programming-guide.md#upgrading-from-spark-sql-23-to-24) I've created https://issues.apache.org/jira/browse/SPARK-24625 to track it --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21587: [SPARK-24588][SS] streaming join should require HashClus...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21587 **[Test build #92177 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92177/testReport)** for PR 21587 at commit [`72466b0`](https://github.com/apache/spark/commit/72466b0026469379ff1ccce350b538f66c0b384a). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21587: [SPARK-24588][SS] streaming join should require HashClus...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21587 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21587: [SPARK-24588][SS] streaming join should require HashClus...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21587 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/381/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21061: [SPARK-23914][SQL] Add array_union function
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/21061#discussion_r197229189 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala --- @@ -2355,3 +2355,347 @@ case class ArrayRemove(left: Expression, right: Expression) override def prettyName: String = "array_remove" } + +object ArraySetLike { + def useGenericArrayData(elementSize: Int, length: Int): Boolean = { +// Use the same calculation in UnsafeArrayData.fromPrimitiveArray() +val headerInBytes = UnsafeArrayData.calculateHeaderPortionInBytes(length) +val valueRegionInBytes = elementSize.toLong * length +val totalSizeInLongs = (headerInBytes + valueRegionInBytes + 7) / 8 +totalSizeInLongs > Integer.MAX_VALUE / 8 + } + + def throwUnionLengthOverflowException(length: Int): Unit = { +throw new RuntimeException(s"Unsuccessful try to union arrays with $length " + + s"elements due to exceeding the array size limit " + + s"${ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH}.") + } +} + + +abstract class ArraySetLike extends BinaryArrayExpressionWithImplicitCast { + override def dataType: DataType = left.dataType + + override def checkInputDataTypes(): TypeCheckResult = { +val typeCheckResult = super.checkInputDataTypes() +if (typeCheckResult.isSuccess) { + TypeUtils.checkForOrderingExpr(dataType.asInstanceOf[ArrayType].elementType, +s"function $prettyName") +} else { + typeCheckResult +} + } + + @transient protected lazy val ordering: Ordering[Any] = +TypeUtils.getInterpretedOrdering(elementType) + + @transient protected lazy val elementTypeSupportEquals = elementType match { +case BinaryType => false +case _: AtomicType => true +case _ => false + } +} + +/** + * Returns an array of the elements in the union of x and y, without duplicates + */ +@ExpressionDescription( + usage = """ +_FUNC_(array1, array2) - Returns an array of the elements in the union of array1 and array2, + without duplicates. + """, + examples = """ +Examples: + > SELECT _FUNC_(array(1, 2, 3), array(1, 3, 5)); + array(1, 2, 3, 5) + """, + since = "2.4.0") +case class ArrayUnion(left: Expression, right: Expression) extends ArraySetLike { + var hsInt: OpenHashSet[Int] = _ + var hsLong: OpenHashSet[Long] = _ + + def assignInt(array: ArrayData, idx: Int, resultArray: ArrayData, pos: Int): Boolean = { +val elem = array.getInt(idx) +if (!hsInt.contains(elem)) { + resultArray.setInt(pos, elem) + hsInt.add(elem) + true +} else { + false +} + } + + def assignLong(array: ArrayData, idx: Int, resultArray: ArrayData, pos: Int): Boolean = { +val elem = array.getLong(idx) +if (!hsLong.contains(elem)) { + resultArray.setLong(pos, elem) + hsLong.add(elem) + true +} else { + false +} + } + + def evalPrimitiveType( + array1: ArrayData, + array2: ArrayData, + size: Int, + resultArray: ArrayData, + isLongType: Boolean): ArrayData = { +// store elements into resultArray +var foundNullElement = false +var pos = 0 +Seq(array1, array2).foreach(array => { + var i = 0 + while (i < array.numElements()) { +if (array.isNullAt(i)) { + if (!foundNullElement) { +resultArray.setNullAt(pos) +pos += 1 +foundNullElement = true + } +} else { + val assigned = if (!isLongType) { +assignInt(array, i, resultArray, pos) + } else { +assignLong(array, i, resultArray, pos) + } + if (assigned) { +pos += 1 + } +} +i += 1 + } +}) +resultArray + } + + override def nullSafeEval(input1: Any, input2: Any): Any = { +val array1 = input1.asInstanceOf[ArrayData] +val array2 = input2.asInstanceOf[ArrayData] + +if (elementTypeSupportEquals) { + elementType match { +case IntegerType => + // avoid boxing of primitive int array elements + // calculate result array size + val hsSize = new OpenHashSet[Int] + Seq(array1, array2).foreach(array => { +var i = 0 +while (i < array.numElements()) { + if (hsSize.size > ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH) { +ArraySetLi
[GitHub] spark issue #21598: [SPARK-24605][SQL] size(null) returns null instead of -1
Github user rxin commented on the issue: https://github.com/apache/spark/pull/21598 Do we have other "legacy" configs that we haven't released and can change to match this prefix? It's pretty nice to have a single prefix for stuff like this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21061: [SPARK-23914][SQL] Add array_union function
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/21061#discussion_r197229074 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala --- @@ -2355,3 +2355,347 @@ case class ArrayRemove(left: Expression, right: Expression) override def prettyName: String = "array_remove" } + +object ArraySetLike { + def useGenericArrayData(elementSize: Int, length: Int): Boolean = { +// Use the same calculation in UnsafeArrayData.fromPrimitiveArray() +val headerInBytes = UnsafeArrayData.calculateHeaderPortionInBytes(length) +val valueRegionInBytes = elementSize.toLong * length +val totalSizeInLongs = (headerInBytes + valueRegionInBytes + 7) / 8 +totalSizeInLongs > Integer.MAX_VALUE / 8 + } + + def throwUnionLengthOverflowException(length: Int): Unit = { +throw new RuntimeException(s"Unsuccessful try to union arrays with $length " + + s"elements due to exceeding the array size limit " + + s"${ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH}.") + } +} + + +abstract class ArraySetLike extends BinaryArrayExpressionWithImplicitCast { + override def dataType: DataType = left.dataType + + override def checkInputDataTypes(): TypeCheckResult = { +val typeCheckResult = super.checkInputDataTypes() +if (typeCheckResult.isSuccess) { + TypeUtils.checkForOrderingExpr(dataType.asInstanceOf[ArrayType].elementType, +s"function $prettyName") +} else { + typeCheckResult +} + } + + @transient protected lazy val ordering: Ordering[Any] = +TypeUtils.getInterpretedOrdering(elementType) + + @transient protected lazy val elementTypeSupportEquals = elementType match { +case BinaryType => false +case _: AtomicType => true +case _ => false + } +} + +/** + * Returns an array of the elements in the union of x and y, without duplicates + */ +@ExpressionDescription( + usage = """ +_FUNC_(array1, array2) - Returns an array of the elements in the union of array1 and array2, + without duplicates. + """, + examples = """ +Examples: + > SELECT _FUNC_(array(1, 2, 3), array(1, 3, 5)); + array(1, 2, 3, 5) + """, + since = "2.4.0") +case class ArrayUnion(left: Expression, right: Expression) extends ArraySetLike { + var hsInt: OpenHashSet[Int] = _ + var hsLong: OpenHashSet[Long] = _ + + def assignInt(array: ArrayData, idx: Int, resultArray: ArrayData, pos: Int): Boolean = { +val elem = array.getInt(idx) +if (!hsInt.contains(elem)) { + resultArray.setInt(pos, elem) + hsInt.add(elem) + true +} else { + false +} + } + + def assignLong(array: ArrayData, idx: Int, resultArray: ArrayData, pos: Int): Boolean = { +val elem = array.getLong(idx) +if (!hsLong.contains(elem)) { + resultArray.setLong(pos, elem) + hsLong.add(elem) + true +} else { + false +} + } + + def evalPrimitiveType( + array1: ArrayData, + array2: ArrayData, + size: Int, + resultArray: ArrayData, + isLongType: Boolean): ArrayData = { +// store elements into resultArray +var foundNullElement = false +var pos = 0 +Seq(array1, array2).foreach(array => { + var i = 0 + while (i < array.numElements()) { +if (array.isNullAt(i)) { + if (!foundNullElement) { +resultArray.setNullAt(pos) +pos += 1 +foundNullElement = true + } +} else { + val assigned = if (!isLongType) { +assignInt(array, i, resultArray, pos) + } else { +assignLong(array, i, resultArray, pos) + } + if (assigned) { +pos += 1 + } +} +i += 1 + } +}) +resultArray + } + + override def nullSafeEval(input1: Any, input2: Any): Any = { +val array1 = input1.asInstanceOf[ArrayData] +val array2 = input2.asInstanceOf[ArrayData] + +if (elementTypeSupportEquals) { + elementType match { +case IntegerType => + // avoid boxing of primitive int array elements + // calculate result array size + val hsSize = new OpenHashSet[Int] + Seq(array1, array2).foreach(array => { +var i = 0 +while (i < array.numElements()) { + if (hsSize.size > ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH) { +ArraySetLi
[GitHub] spark issue #21577: [SPARK-24589][core] Correctly identify tasks in output c...
Github user tgravescs commented on the issue: https://github.com/apache/spark/pull/21577 I will --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...
Github user tgravescs commented on the issue: https://github.com/apache/spark/pull/21558 If you are working on this, I'll merge the other one and wait for you and continue to investigate in parallel --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21061: [SPARK-23914][SQL] Add array_union function
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/21061#discussion_r197228219 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala --- @@ -2355,3 +2355,347 @@ case class ArrayRemove(left: Expression, right: Expression) override def prettyName: String = "array_remove" } + +object ArraySetLike { + def useGenericArrayData(elementSize: Int, length: Int): Boolean = { +// Use the same calculation in UnsafeArrayData.fromPrimitiveArray() +val headerInBytes = UnsafeArrayData.calculateHeaderPortionInBytes(length) +val valueRegionInBytes = elementSize.toLong * length +val totalSizeInLongs = (headerInBytes + valueRegionInBytes + 7) / 8 +totalSizeInLongs > Integer.MAX_VALUE / 8 + } + + def throwUnionLengthOverflowException(length: Int): Unit = { +throw new RuntimeException(s"Unsuccessful try to union arrays with $length " + + s"elements due to exceeding the array size limit " + + s"${ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH}.") + } +} + + +abstract class ArraySetLike extends BinaryArrayExpressionWithImplicitCast { + override def dataType: DataType = left.dataType + + override def checkInputDataTypes(): TypeCheckResult = { +val typeCheckResult = super.checkInputDataTypes() +if (typeCheckResult.isSuccess) { + TypeUtils.checkForOrderingExpr(dataType.asInstanceOf[ArrayType].elementType, +s"function $prettyName") +} else { + typeCheckResult +} + } + + @transient protected lazy val ordering: Ordering[Any] = +TypeUtils.getInterpretedOrdering(elementType) + + @transient protected lazy val elementTypeSupportEquals = elementType match { +case BinaryType => false +case _: AtomicType => true +case _ => false + } +} + +/** + * Returns an array of the elements in the union of x and y, without duplicates + */ +@ExpressionDescription( + usage = """ +_FUNC_(array1, array2) - Returns an array of the elements in the union of array1 and array2, + without duplicates. + """, + examples = """ +Examples: + > SELECT _FUNC_(array(1, 2, 3), array(1, 3, 5)); + array(1, 2, 3, 5) + """, + since = "2.4.0") +case class ArrayUnion(left: Expression, right: Expression) extends ArraySetLike { + var hsInt: OpenHashSet[Int] = _ + var hsLong: OpenHashSet[Long] = _ + + def assignInt(array: ArrayData, idx: Int, resultArray: ArrayData, pos: Int): Boolean = { +val elem = array.getInt(idx) +if (!hsInt.contains(elem)) { + resultArray.setInt(pos, elem) + hsInt.add(elem) + true +} else { + false +} + } + + def assignLong(array: ArrayData, idx: Int, resultArray: ArrayData, pos: Int): Boolean = { +val elem = array.getLong(idx) +if (!hsLong.contains(elem)) { + resultArray.setLong(pos, elem) + hsLong.add(elem) + true +} else { + false +} + } + + def evalPrimitiveType( + array1: ArrayData, + array2: ArrayData, + size: Int, + resultArray: ArrayData, + isLongType: Boolean): ArrayData = { +// store elements into resultArray +var foundNullElement = false +var pos = 0 +Seq(array1, array2).foreach(array => { + var i = 0 + while (i < array.numElements()) { +if (array.isNullAt(i)) { + if (!foundNullElement) { +resultArray.setNullAt(pos) +pos += 1 +foundNullElement = true + } +} else { + val assigned = if (!isLongType) { +assignInt(array, i, resultArray, pos) + } else { +assignLong(array, i, resultArray, pos) + } + if (assigned) { +pos += 1 + } +} +i += 1 + } +}) +resultArray + } + + override def nullSafeEval(input1: Any, input2: Any): Any = { +val array1 = input1.asInstanceOf[ArrayData] +val array2 = input2.asInstanceOf[ArrayData] + +if (elementTypeSupportEquals) { + elementType match { +case IntegerType => + // avoid boxing of primitive int array elements + // calculate result array size + val hsSize = new OpenHashSet[Int] + Seq(array1, array2).foreach(array => { +var i = 0 +while (i < array.numElements()) { + if (hsSize.size > ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH) { +ArraySetLi
[GitHub] spark pull request #21598: [SPARK-24605][SQL] size(null) returns null instea...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21598#discussion_r197227370 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -1314,6 +1314,13 @@ object SQLConf { "Other column values can be ignored during parsing even if they are malformed.") .booleanConf .createWithDefault(true) + + val LEGACY_SIZE_OF_NULL = buildConf("spark.sql.legacy.sizeOfNull") --- End diff -- I've created https://issues.apache.org/jira/browse/SPARK-24625 to track it. It's similar to https://github.com/apache/spark/pull/21427#issuecomment-396142545 , but as I replied in that PR, having version specific config is an overkill, while `legacy` is simpler and more explicit that it will be removed in the future. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...
Github user tgravescs commented on the issue: https://github.com/apache/spark/pull/21558 Ah ok, was looking at my own version as well. There are other things we should update for v2 as well, other functions with the variable names, description in DataWriterFactory.java, etc. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21558: [SPARK-24552][SQL] Use task ID instead of attempt...
Github user tgravescs commented on a diff in the pull request: https://github.com/apache/spark/pull/21558#discussion_r197222759 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2.scala --- @@ -110,7 +108,7 @@ object DataWritingSparkTask extends Logging { useCommitCoordinator: Boolean): WriterCommitMessage = { val stageId = context.stageId() val partId = context.partitionId() -val attemptId = context.attemptNumber() +val attemptId = context.taskAttemptId().toInt --- End diff -- HadoopWriteConfigUtil has the same issue, its a public interface and uses in for attempt number. it seems somewhat unlikely but more likely to be able to go over an int for task ids in spark then in say MapReduce. we do have partitionId as an Int so if partitions go to Int and you have task failures then taskids could go over Int. Looking at our options --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21599: [SPARK-24598][SQL] Overflow on arithmetic operations ret...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21599 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92172/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21599: [SPARK-24598][SQL] Overflow on arithmetic operations ret...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21599 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21599: [SPARK-24598][SQL] Overflow on arithmetic operations ret...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21599 **[Test build #92172 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92172/testReport)** for PR 21599 at commit [`9c3df7d`](https://github.com/apache/spark/commit/9c3df7d553f523581fe79a64a1bb167062728103). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/21558 FYI, I'm preparing my own version of this PR with the remaining feedback addressed. Ryan was on paternity leave and I don't know whether he's done yet, so he may not be that responsive. This will conflict with the output commit coordinator change in any case, so one of them needs to wait (and that one is further along). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21527: [SPARK-24519] Make the threshold for highly compressed m...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21527 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21527: [SPARK-24519] Make the threshold for highly compressed m...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21527 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92169/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21527: [SPARK-24519] Make the threshold for highly compressed m...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21527 **[Test build #92169 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92169/testReport)** for PR 21527 at commit [`4cb492e`](https://github.com/apache/spark/commit/4cb492ed483c73acdd8f77aef524582ce4dcedff). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21577: [SPARK-24589][core] Correctly identify tasks in output c...
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/21577 So anyone wants to do the actual merging? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21598: [SPARK-24605][SQL] size(null) returns null instead of -1
Github user rxin commented on the issue: https://github.com/apache/spark/pull/21598 This is not a "bug" and there is no "right" behavior in APIs. It's been defined as -1 since the very beginning (when was it added?), so we can't just change the default value in a feature release. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21565: [SPARK-24558][Core]wrong Idle Timeout value is used in c...
Github user sandeep-katta commented on the issue: https://github.com/apache/spark/pull/21565 @cloud-fan can you please review this small piece of code and merge this PR --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20944: [SPARK-23831][SQL] Add org.apache.derby to IsolatedClien...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20944 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20944: [SPARK-23831][SQL] Add org.apache.derby to IsolatedClien...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20944 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92171/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20944: [SPARK-23831][SQL] Add org.apache.derby to IsolatedClien...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20944 **[Test build #92171 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92171/testReport)** for PR 20944 at commit [`7133d7a`](https://github.com/apache/spark/commit/7133d7aac3c1aaffbd8a9c4433060c4a9d488035). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21588: [SPARK-24590][BUILD] Make Jenkins tests passed with hado...
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/21588 Let me phrase the question a different way: your title says "Make jenkins tests passed" [sic]. If you check this in, and we enable a jenkins job for hadoop 3, will it pass? I'm 100% sure the answer is no. So fix the Hive fork, then update this PR, and let's have it actually run through jenkins. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21548: [SPARK-24518][CORE] Using Hadoop credential provider API...
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/21548 Could you update the summary so that it doesn't sound like this is an existing security issue? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20944: [SPARK-23831][SQL] Add org.apache.derby to IsolatedClien...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20944 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20944: [SPARK-23831][SQL] Add org.apache.derby to IsolatedClien...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20944 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92170/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20944: [SPARK-23831][SQL] Add org.apache.derby to IsolatedClien...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20944 **[Test build #92170 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92170/testReport)** for PR 20944 at commit [`da36564`](https://github.com/apache/spark/commit/da3656490c7863b74f6274d463b07abcd015ade5). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21596: [SPARK-24601] Bump Jackson version
Github user MaxGekk commented on the issue: https://github.com/apache/spark/pull/21596 Please, make sure that performance doesn't degrade after upgrading Jackson. You can check that by [JsonBenchmarks](https://github.com/apache/spark/blob/bd14da6fd5a77cc03efff193a84ffccbe892cc13/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonBenchmarks.scala). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21603: [SPARK-17091][SQL] Add rule to convert IN predica...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21603#discussion_r197191390 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala --- @@ -270,6 +270,11 @@ private[parquet] class ParquetFilters(pushDownDate: Boolean) { case sources.Not(pred) => createFilter(schema, pred).map(FilterApi.not) + case sources.In(name, values) if canMakeFilterOn(name) && values.length < 20 => --- End diff -- spark.sql.parquet.pushdown.inFilterThreshold. By default, it should be around 10. Please also check the perf. cc @jiangxb1987 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16677: [SPARK-19355][SQL] Use map output statistics to improve ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16677 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16677: [SPARK-19355][SQL] Use map output statistics to improve ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16677 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92168/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16677: [SPARK-19355][SQL] Use map output statistics to improve ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16677 **[Test build #92168 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92168/testReport)** for PR 16677 at commit [`5594bf9`](https://github.com/apache/spark/commit/5594bf9f13aa83d05a433bad0fd366daabd2d034). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21605: [SPARK-24385][SQL] Resolve self-join condition ambiguity...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21605 **[Test build #92176 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92176/testReport)** for PR 21605 at commit [`fb8cd54`](https://github.com/apache/spark/commit/fb8cd5466c39b69822db145d17a2e40255e74b03). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21605: [SPARK-24385][SQL] Resolve self-join condition ambiguity...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21605 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/380/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21605: [SPARK-24385][SQL] Resolve self-join condition ambiguity...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21605 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21605: [SPARK-24385][SQL] Resolve self-join condition ambiguity...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/21605 cc @cloud-fan @daniel-shields @WenboZhao --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/21449 ok so I created https://github.com/apache/spark/pull/21605 for the fix proposed by @daniel-shields. I'd like to leave this open in order to go on with the discussion for a long-term better fix. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21605: [SPARK-24385][SQL] Resolve self-join condition am...
GitHub user mgaido91 opened a pull request: https://github.com/apache/spark/pull/21605 [SPARK-24385][SQL] Resolve self-join condition ambiguity for EqualNullSafe ## What changes were proposed in this pull request? In Dataset.join we have a small hack for resolving ambiguity in the column name for self-joins. The current code supports only `EqualTo`. The PR extends the fix to `EqualNullSafe`. Credit for this PR should be given to @daniel-shields. ## How was this patch tested? added UT You can merge this pull request into a Git repository by running: $ git pull https://github.com/mgaido91/spark SPARK-24385_2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21605.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21605 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21282: [SPARK-23934][SQL] Adding map_from_entries function
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21282 **[Test build #92175 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92175/testReport)** for PR 21282 at commit [`4eaedc5`](https://github.com/apache/spark/commit/4eaedc50f92a3dd1ee2100fbbd5ac951344ece75). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21548: [SPARK-24518][CORE] Using Hadoop credential provi...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/21548#discussion_r197182301 --- Diff: core/src/main/scala/org/apache/spark/SSLOptions.scala --- @@ -179,9 +185,11 @@ private[spark] object SSLOptions extends Logging { .orElse(defaults.flatMap(_.keyStore)) val keyStorePassword = conf.getWithSubstitution(s"$ns.keyStorePassword") + .orElse(Option(hadoopConf.getPassword(s"$ns.keyStorePassword")).map(new String(_))) --- End diff -- `new String` takes a charset. (In fact the constructor you're calling should be deprecated...) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...
Github user WenboZhao commented on the issue: https://github.com/apache/spark/pull/21449 I like the proposal by @daniel-shields. If we could get it fixed soon, we will be able to catch up the Spark 2.3.2 release. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwrite a p...
Github user steveloughran commented on the issue: https://github.com/apache/spark/pull/21257 some overall thought * I think this is only happening on a successful job commit, not abort. This is the desired action? * if something goes wrong here, is failing the entire job the correct action? If the deletes were happening earlier, then yes, the job would obviously fail. But now the core work has taken place, it's just cleanup failing. Which could be: permissions, transient network, etc. I'll have to look a bit closer at what happens in committer cleanups right now, though as they are focused on rm -f $dest/__temporary/$jobAttempt, they are less worried about failures here as it shoudn't be changing any public datasets --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21602: [SPARK-24613][SQL] Cache with UDF could not be matched w...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21602 **[Test build #92174 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92174/testReport)** for PR 21602 at commit [`377f213`](https://github.com/apache/spark/commit/377f2134e6b4990b4c1d080e5fd5119fe808e057). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21602: [SPARK-24613][SQL] Cache with UDF could not be matched w...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21602 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21602: [SPARK-24613][SQL] Cache with UDF could not be matched w...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21602 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/379/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwr...
Github user steveloughran commented on a diff in the pull request: https://github.com/apache/spark/pull/21257#discussion_r197177156 --- Diff: core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala --- @@ -235,4 +244,41 @@ class HadoopMapReduceCommitProtocol( tmp.getFileSystem(taskContext.getConfiguration).delete(tmp, false) } } + + /** + * now just record the file to be delete + */ + override def deleteWithJob(fs: FileSystem, path: Path, --- End diff -- No need to worry about concurrent access here, correct? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21602: [SPARK-24613][SQL] Cache with UDF could not be matched w...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21602 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwr...
Github user steveloughran commented on a diff in the pull request: https://github.com/apache/spark/pull/21257#discussion_r197176461 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala --- @@ -207,9 +210,23 @@ case class InsertIntoHadoopFsRelationCommand( } // first clear the path determined by the static partition keys (e.g. /table/foo=1) val staticPrefixPath = qualifiedOutputPath.suffix(staticPartitionPrefix) -if (fs.exists(staticPrefixPath) && !committer.deleteWithJob(fs, staticPrefixPath, true)) { - throw new IOException(s"Unable to clear output " + -s"directory $staticPrefixPath prior to writing to it") +if (fs.exists(staticPrefixPath)) { + if (staticPartitionPrefix.isEmpty && outputCheck) { +// input contain output, only delete output sub files when job commit + val files = fs.listFiles(staticPrefixPath, false) + while (files.hasNext) { +val file = files.next() +if (!committer.deleteWithJob(fs, file.getPath, false)) { + throw new IOException(s"Unable to clear output " + +s"directory ${file.getPath} prior to writing to it") +} + } + } else { +if (!committer.deleteWithJob(fs, staticPrefixPath, true)) { + throw new IOException(s"Unable to clear output " + --- End diff -- again, hard to see how this exception path would be reached. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwr...
Github user steveloughran commented on a diff in the pull request: https://github.com/apache/spark/pull/21257#discussion_r197176292 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala --- @@ -207,9 +210,23 @@ case class InsertIntoHadoopFsRelationCommand( } // first clear the path determined by the static partition keys (e.g. /table/foo=1) val staticPrefixPath = qualifiedOutputPath.suffix(staticPartitionPrefix) -if (fs.exists(staticPrefixPath) && !committer.deleteWithJob(fs, staticPrefixPath, true)) { - throw new IOException(s"Unable to clear output " + -s"directory $staticPrefixPath prior to writing to it") +if (fs.exists(staticPrefixPath)) { + if (staticPartitionPrefix.isEmpty && outputCheck) { +// input contain output, only delete output sub files when job commit + val files = fs.listFiles(staticPrefixPath, false) --- End diff -- if there are a lot of files here, you've gone from a dir delete which was O(1) on a fileystem, probably O(descendant) on an object store to at O(children) on an FS, O(children * descendants (chlld)) op here. Not significant for a small number of files, but could potentially be expensive. Why do the iteration at all? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwr...
Github user steveloughran commented on a diff in the pull request: https://github.com/apache/spark/pull/21257#discussion_r197174835 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala --- @@ -207,9 +210,23 @@ case class InsertIntoHadoopFsRelationCommand( } // first clear the path determined by the static partition keys (e.g. /table/foo=1) val staticPrefixPath = qualifiedOutputPath.suffix(staticPartitionPrefix) -if (fs.exists(staticPrefixPath) && !committer.deleteWithJob(fs, staticPrefixPath, true)) { - throw new IOException(s"Unable to clear output " + -s"directory $staticPrefixPath prior to writing to it") +if (fs.exists(staticPrefixPath)) { + if (staticPartitionPrefix.isEmpty && outputCheck) { +// input contain output, only delete output sub files when job commit + val files = fs.listFiles(staticPrefixPath, false) + while (files.hasNext) { +val file = files.next() +if (!committer.deleteWithJob(fs, file.getPath, false)) { + throw new IOException(s"Unable to clear output " + --- End diff -- as `committer.deleteWithJob()` returns true in base class, that check won't do much, at least not with the default impl. Probably better just to have `deleteWithJob()` return Unit, require callers to raise an exception on a delete failure. Given that delete() is required to say "dest doesn't exist if you return", I don't think they need to do any checks at all --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org