[GitHub] spark pull request #21590: [SPARK-24423][SQL] Add a new option for JDBC sour...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21590#discussion_r197347130 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala --- @@ -65,13 +65,38 @@ class JDBCOptions( // Required parameters // require(parameters.isDefinedAt(JDBC_URL), s"Option '$JDBC_URL' is required.") - require(parameters.isDefinedAt(JDBC_TABLE_NAME), s"Option '$JDBC_TABLE_NAME' is required.") + // a JDBC URL val url = parameters(JDBC_URL) - // name of table - val table = parameters(JDBC_TABLE_NAME) + val tableName = parameters.get(JDBC_TABLE_NAME) + val query = parameters.get(JDBC_QUERY_STRING) + // Following two conditions make sure that : + // 1. One of the option (dbtable or query) must be specified. + // 2. Both of them can not be specified at the same time as they are conflicting in nature. + require( +tableName.isDefined || query.isDefined, +s"Option '$JDBC_TABLE_NAME' or '${JDBC_QUERY_STRING}' is required." + ) + + require( +!(tableName.isDefined && query.isDefined), +s"Both '$JDBC_TABLE_NAME' and '$JDBC_QUERY_STRING' can not be specified." + ) + + // table name or a table expression. + val tableOrQuery = tableName.map(_.trim).getOrElse { +// We have ensured in the code above that either dbtable or query is specified. +query.get match { + case subQuery if subQuery.nonEmpty => s"(${subQuery}) spark_gen_${curId.getAndIncrement()}" + case subQuery => subQuery +} + } + + require(tableOrQuery.nonEmpty, +s"Empty string is not allowed in either '$JDBC_TABLE_NAME' or '${JDBC_QUERY_STRING}' options" + ) + - // --- End diff -- nit: revert this line --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21061: [SPARK-23914][SQL] Add array_union function
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21061 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92197/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21061: [SPARK-23914][SQL] Add array_union function
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21061 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21061: [SPARK-23914][SQL] Add array_union function
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21061 **[Test build #92197 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92197/testReport)** for PR 21061 at commit [`195f3bd`](https://github.com/apache/spark/commit/195f3bd6b47da19b27cd0c8140bcd9aa6a063843). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21608: [SPARK-24626] [SQL] Improve Analyze Table command
Github user maropu commented on the issue: https://github.com/apache/spark/pull/21608 This pr improves actual performance values? (My question is that the calculation is a bottleneck?) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21606: [SPARK-24552][core][SQL] Use task ID instead of attempt ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21606 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92192/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21606: [SPARK-24552][core][SQL] Use task ID instead of attempt ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21606 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21606: [SPARK-24552][core][SQL] Use task ID instead of attempt ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21606 **[Test build #92192 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92192/testReport)** for PR 21606 at commit [`a16d9f9`](https://github.com/apache/spark/commit/a16d9f907b3ce0078da72b7e7bcc56e187cbc8f9). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21482: [SPARK-24393][SQL] SQL builtin: isinf
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21482 I have no more comments except the one above. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21482: [SPARK-24393][SQL] SQL builtin: isinf
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21482#discussion_r197340906 --- Diff: python/pyspark/sql/functions.py --- @@ -468,6 +468,18 @@ def input_file_name(): return Column(sc._jvm.functions.input_file_name()) +@since(2.4) +def isinf(col): --- End diff -- Yes, please because I see it's exposed in Column.scala. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21594: [SPARK-24596][SQL] Non-cascading Cache Invalidation
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21594 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92194/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21594: [SPARK-24596][SQL] Non-cascading Cache Invalidation
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21594 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21594: [SPARK-24596][SQL] Non-cascading Cache Invalidation
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21594 **[Test build #92194 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92194/testReport)** for PR 21594 at commit [`2f00f2f`](https://github.com/apache/spark/commit/2f00f2fe0e1cf9a0d44285aab306ed55bd176d9c). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21603: [SPARK-17091][SQL] Add rule to convert IN predicate to e...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21603 **[Test build #92198 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92198/testReport)** for PR 21603 at commit [`b9b3160`](https://github.com/apache/spark/commit/b9b3160061ef1e17ae32599ed9fbcfd44b0565b4). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21603: [SPARK-17091][SQL] Add rule to convert IN predicate to e...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21603 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21603: [SPARK-17091][SQL] Add rule to convert IN predicate to e...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21603 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/399/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21603: [SPARK-17091][SQL] Add rule to convert IN predica...
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/21603#discussion_r197338867 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala --- @@ -270,6 +270,11 @@ private[parquet] class ParquetFilters(pushDownDate: Boolean) { case sources.Not(pred) => createFilter(schema, pred).map(FilterApi.not) + case sources.In(name, values) if canMakeFilterOn(name) && values.length < 20 => --- End diff -- It seems that the push-down performance is better when threshold is less than `300`: https://user-images.githubusercontent.com/5399861/41757743-7e411532-7616-11e8-8844-45132c50c535.png;> The code: ```scala withSQLConf(SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key -> "true") { import testImplicits._ withTempPath { path => val total = 1000 (0 to total).toDF().coalesce(1) .write.option("parquet.block.size", 512) .parquet(path.getAbsolutePath) val df = spark.read.parquet(path.getAbsolutePath) // scalastyle:off println var lastSize = -1 var i = 16000 while (i < total) { val filter = Range(0, total).filter(_ % i == 0) i += 100 if (lastSize != filter.size) { if (lastSize == -1) println(s"start size: ${filter.size}") lastSize = filter.size sql("set spark.sql.parquet.pushdown.inFilterThreshold=100") val begin1 = System.currentTimeMillis() df.where(s"id in(${filter.mkString(",")})").count() val end1 = System.currentTimeMillis() val time1 = end1 - begin1 sql("set spark.sql.parquet.pushdown.inFilterThreshold=10") val begin2 = System.currentTimeMillis() df.where(s"id in(${filter.mkString(",")})").count() val end2 = System.currentTimeMillis() val time2 = end2 - begin2 if (time1 <= time2) println(s"Max threshold: $lastSize") } } } } ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21610: Updates to LICENSE and NOTICE
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21610 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21610: Updates to LICENSE and NOTICE
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21610 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21610: Updates to LICENSE and NOTICE
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21610 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21610: Updates to LICENSE and NOTICE
GitHub user justinmclean opened a pull request: https://github.com/apache/spark/pull/21610 Updates to LICENSE and NOTICE ## What changes were proposed in this pull request? LICENSE and NOTICE changes as per ASF policy ## How was this patch tested? N/A Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/justinmclean/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21610.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21610 commit b9d12d700b9cb83402e42f264f21bca090e0d1e3 Author: Justin Mclean Date: 2018-06-22T04:20:59Z Updates to LICENSE and NOTICE --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21609: [SPARK-22897][CORE] Expose stageAttemptId in TaskContext
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21609 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92196/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21609: [SPARK-22897][CORE] Expose stageAttemptId in TaskContext
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21609 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21609: [SPARK-22897][CORE] Expose stageAttemptId in TaskContext
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21609 **[Test build #92196 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92196/testReport)** for PR 21609 at commit [`3040763`](https://github.com/apache/spark/commit/3040763e51c8d32309f2dc38ce8b9fcc740ceb3d). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21603: [SPARK-17091][SQL] Add rule to convert IN predica...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21603#discussion_r197336527 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala --- @@ -270,6 +270,11 @@ private[parquet] class ParquetFilters(pushDownDate: Boolean) { case sources.Not(pred) => createFilter(schema, pred).map(FilterApi.not) + case sources.In(name, values) if canMakeFilterOn(name) && values.length < 20 => --- End diff -- +1 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21607: branch-2.1: backport SPARK-24589 and SPARK-22897
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21607 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92195/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21607: branch-2.1: backport SPARK-24589 and SPARK-22897
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21607 **[Test build #92195 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92195/testReport)** for PR 21607 at commit [`9d7e6ea`](https://github.com/apache/spark/commit/9d7e6eafff3daa519f7fda0b1f219f74d499874d). * This patch **fails SparkR unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21607: branch-2.1: backport SPARK-24589 and SPARK-22897
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21607 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21607: branch-2.1: backport SPARK-24589 and SPARK-22897
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21607 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21607: branch-2.1: backport SPARK-24589 and SPARK-22897
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21607 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92193/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21607: branch-2.1: backport SPARK-24589 and SPARK-22897
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21607 **[Test build #92193 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92193/testReport)** for PR 21607 at commit [`0520d60`](https://github.com/apache/spark/commit/0520d60b44987369fa62d7237427cb0cf022ed41). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21606: [SPARK-24552][core][SQL] Use task ID instead of attempt ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21606 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92190/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21606: [SPARK-24552][core][SQL] Use task ID instead of attempt ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21606 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21606: [SPARK-24552][core][SQL] Use task ID instead of attempt ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21606 **[Test build #92190 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92190/testReport)** for PR 21606 at commit [`227d513`](https://github.com/apache/spark/commit/227d513ade176fd56f7e6d75a16deb6c654982db). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21606: [SPARK-24552][core][SQL] Use task ID instead of attempt ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21606 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92189/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21606: [SPARK-24552][core][SQL] Use task ID instead of attempt ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21606 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21606: [SPARK-24552][core][SQL] Use task ID instead of attempt ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21606 **[Test build #92189 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92189/testReport)** for PR 21606 at commit [`5efaae7`](https://github.com/apache/spark/commit/5efaae74bf340fed4223b5209bed63475cc35516). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21320: [SPARK-4502][SQL] Parquet nested column pruning - founda...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21320 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92191/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21320: [SPARK-4502][SQL] Parquet nested column pruning - founda...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21320 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21320: [SPARK-4502][SQL] Parquet nested column pruning - founda...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21320 **[Test build #92191 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92191/testReport)** for PR 21320 at commit [`a255bcb`](https://github.com/apache/spark/commit/a255bcb4c480d3c97f7ff0590bca0c20de034a31). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21609: [SPARK-22897][CORE] Expose stageAttemptId in TaskContext
Github user zzcclp commented on the issue: https://github.com/apache/spark/pull/21609 Can this pr be merged ASAP? Currently there is an error on branch-2.2 . --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21061: [SPARK-23914][SQL] Add array_union function
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21061 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/398/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21061: [SPARK-23914][SQL] Add array_union function
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21061 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21588: [SPARK-24590][BUILD] Make Jenkins tests passed with hado...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21588 Yup, will fix the hive fork thing and be back. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21570: [SPARK-24564][TEST] Add test suite for RecordBina...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/21570#discussion_r197328626 --- Diff: sql/core/src/test/java/test/org/apache/spark/sql/execution/sort/RecordBinaryComparatorSuite.java --- @@ -0,0 +1,255 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package test.org.apache.spark.sql.execution.sort; + +import org.apache.spark.SparkConf; +import org.apache.spark.memory.TaskMemoryManager; --- End diff -- cc @jiangxb1987 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21061: [SPARK-23914][SQL] Add array_union function
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21061 **[Test build #92197 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92197/testReport)** for PR 21061 at commit [`195f3bd`](https://github.com/apache/spark/commit/195f3bd6b47da19b27cd0c8140bcd9aa6a063843). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21588: [SPARK-24590][BUILD] Make Jenkins tests passed with hado...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/21588 @HyukjinKwon , I'm in favor of @vanzin 's comment, we should fix things first and then back to this one. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21548: [SPARK-24518][CORE] Using Hadoop credential provi...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/21548#discussion_r197327620 --- Diff: core/src/main/scala/org/apache/spark/SSLOptions.scala --- @@ -179,9 +185,11 @@ private[spark] object SSLOptions extends Logging { .orElse(defaults.flatMap(_.keyStore)) val keyStorePassword = conf.getWithSubstitution(s"$ns.keyStorePassword") + .orElse(Option(hadoopConf.getPassword(s"$ns.keyStorePassword")).map(new String(_))) --- End diff -- Hi @vanzin , I checked jdk8 doc again, I don't find a String constructor which takes both char array and charset as parameters. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21606: [SPARK-24552][core][SQL] Use task ID instead of attempt ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21606 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92187/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21606: [SPARK-24552][core][SQL] Use task ID instead of attempt ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21606 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21606: [SPARK-24552][core][SQL] Use task ID instead of attempt ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21606 **[Test build #92187 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92187/testReport)** for PR 21606 at commit [`c884f4f`](https://github.com/apache/spark/commit/c884f4f27199b3c91f56ba0042b42d09bc243883). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21598: [SPARK-24605][SQL] size(null) returns null instea...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21598#discussion_r197326162 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -1314,6 +1314,13 @@ object SQLConf { "Other column values can be ignored during parsing even if they are malformed.") .booleanConf .createWithDefault(true) + + val LEGACY_SIZE_OF_NULL = buildConf("spark.sql.legacy.sizeOfNull") --- End diff -- That's basically the same except that the postfix includes a specific version, which was just a rough idea. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21577: [SPARK-24589][core] Correctly identify tasks in output c...
Github user zzcclp commented on the issue: https://github.com/apache/spark/pull/21577 @vanzin @tgravescs , after merge this pr into branch-2.2, there is an error "stageAttemptNumber is not a member of org.apache.spark.TaskContext" in SparkHadoopMapRedUtil, I think it needs to merge PR-20082 first. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21598: [SPARK-24605][SQL] size(null) returns null instead of -1
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21598 My assumption was that the PR and JIRA claim that it's the right behaviour, as I said multiple times. If there's no such thing, there should be of course no need to argue about the default value, as I said above. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21542: [SPARK-24529][Build][test-maven] Add spotbugs into maven...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/21542 Even when we stop forking SpotBugs, the same error occurred. @HyukjinKwon is there any idea? I would appreciate your thoughts. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21607: branch-2.1: backport SPARK-24589 and SPARK-22897
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21607 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92188/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21607: branch-2.1: backport SPARK-24589 and SPARK-22897
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21607 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21607: branch-2.1: backport SPARK-24589 and SPARK-22897
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21607 **[Test build #92188 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92188/testReport)** for PR 21607 at commit [`d1f3219`](https://github.com/apache/spark/commit/d1f3219a58f4dc4f1e65a793c6d01572b25a609e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21588: [SPARK-24590][BUILD] Make Jenkins tests passed with hado...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21588 Will try to fix it then. We can just enable it back. If we want to support those Hive versions in Hadoop 3, we could simply enable them back with some fixes at that time. Adding the support sounds an incremental improvement. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21061: [SPARK-23914][SQL] Add array_union function
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/21061#discussion_r197319579 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala --- @@ -2355,3 +2355,347 @@ case class ArrayRemove(left: Expression, right: Expression) override def prettyName: String = "array_remove" } + +object ArraySetLike { + def useGenericArrayData(elementSize: Int, length: Int): Boolean = { +// Use the same calculation in UnsafeArrayData.fromPrimitiveArray() +val headerInBytes = UnsafeArrayData.calculateHeaderPortionInBytes(length) +val valueRegionInBytes = elementSize.toLong * length +val totalSizeInLongs = (headerInBytes + valueRegionInBytes + 7) / 8 +totalSizeInLongs > Integer.MAX_VALUE / 8 + } + + def throwUnionLengthOverflowException(length: Int): Unit = { +throw new RuntimeException(s"Unsuccessful try to union arrays with $length " + + s"elements due to exceeding the array size limit " + + s"${ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH}.") + } +} + + +abstract class ArraySetLike extends BinaryArrayExpressionWithImplicitCast { + override def dataType: DataType = left.dataType + + override def checkInputDataTypes(): TypeCheckResult = { +val typeCheckResult = super.checkInputDataTypes() +if (typeCheckResult.isSuccess) { + TypeUtils.checkForOrderingExpr(dataType.asInstanceOf[ArrayType].elementType, +s"function $prettyName") +} else { + typeCheckResult +} + } + + @transient protected lazy val ordering: Ordering[Any] = +TypeUtils.getInterpretedOrdering(elementType) + + @transient protected lazy val elementTypeSupportEquals = elementType match { +case BinaryType => false +case _: AtomicType => true +case _ => false + } +} + +/** + * Returns an array of the elements in the union of x and y, without duplicates + */ +@ExpressionDescription( + usage = """ +_FUNC_(array1, array2) - Returns an array of the elements in the union of array1 and array2, + without duplicates. + """, + examples = """ +Examples: + > SELECT _FUNC_(array(1, 2, 3), array(1, 3, 5)); + array(1, 2, 3, 5) + """, + since = "2.4.0") +case class ArrayUnion(left: Expression, right: Expression) extends ArraySetLike { + var hsInt: OpenHashSet[Int] = _ + var hsLong: OpenHashSet[Long] = _ + + def assignInt(array: ArrayData, idx: Int, resultArray: ArrayData, pos: Int): Boolean = { +val elem = array.getInt(idx) +if (!hsInt.contains(elem)) { + resultArray.setInt(pos, elem) + hsInt.add(elem) + true +} else { + false +} + } + + def assignLong(array: ArrayData, idx: Int, resultArray: ArrayData, pos: Int): Boolean = { +val elem = array.getLong(idx) +if (!hsLong.contains(elem)) { + resultArray.setLong(pos, elem) + hsLong.add(elem) + true +} else { + false +} + } + + def evalPrimitiveType( + array1: ArrayData, + array2: ArrayData, + size: Int, + resultArray: ArrayData, + isLongType: Boolean): ArrayData = { +// store elements into resultArray +var foundNullElement = false +var pos = 0 +Seq(array1, array2).foreach(array => { + var i = 0 + while (i < array.numElements()) { +if (array.isNullAt(i)) { + if (!foundNullElement) { +resultArray.setNullAt(pos) +pos += 1 +foundNullElement = true + } +} else { + val assigned = if (!isLongType) { +assignInt(array, i, resultArray, pos) + } else { +assignLong(array, i, resultArray, pos) + } + if (assigned) { +pos += 1 + } +} +i += 1 + } +}) +resultArray + } + + override def nullSafeEval(input1: Any, input2: Any): Any = { +val array1 = input1.asInstanceOf[ArrayData] +val array2 = input2.asInstanceOf[ArrayData] + +if (elementTypeSupportEquals) { + elementType match { +case IntegerType => + // avoid boxing of primitive int array elements + // calculate result array size + val hsSize = new OpenHashSet[Int] + Seq(array1, array2).foreach(array => { +var i = 0 +while (i < array.numElements()) { + if (hsSize.size > ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH) { +
[GitHub] spark issue #21607: branch-2.1: backport SPARK-24589 and SPARK-22897
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21607 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21609: [SPARK-22897][CORE] Expose stageAttemptId in TaskContext
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/21609 +1 pending tests. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21607: branch-2.1: backport SPARK-24589 and SPARK-22897
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21607 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/397/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21609: [SPARK-22897][CORE] Expose stageAttemptId in TaskContext
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21609 **[Test build #92196 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92196/testReport)** for PR 21609 at commit [`3040763`](https://github.com/apache/spark/commit/3040763e51c8d32309f2dc38ce8b9fcc740ceb3d). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21609: [SPARK-22897][CORE] Expose stageAttemptId in TaskContext
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21609 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/396/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21607: branch-2.1: backport SPARK-24589 and SPARK-22897
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21607 **[Test build #92195 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92195/testReport)** for PR 21607 at commit [`9d7e6ea`](https://github.com/apache/spark/commit/9d7e6eafff3daa519f7fda0b1f219f74d499874d). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21609: [SPARK-22897][CORE] Expose stageAttemptId in TaskContext
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21609 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21588: [SPARK-24590][BUILD] Make Jenkins tests passed with hado...
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/21588 > The tests were passed in this PR builder Against your private build of the Hive stuff. Again, fix that and this will become a lot easier to discuss. I'm also against disabling these tests without a proper discussion of what that means, and I've said multiple times. If we want to support those Hive versions in Hadoop 3, then this is the wrong change. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21609: [SPARK-22897][CORE] Expose stageAttemptId in TaskContext
Github user tgravescs commented on the issue: https://github.com/apache/spark/pull/21609 backport to branch-2.2, only changes was to mimaExcludes and test file that had one more call to TaskContext. @vanzin --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21609: [SPARK-22897][CORE] Expose stageAttemptId in Task...
GitHub user tgravescs opened a pull request: https://github.com/apache/spark/pull/21609 [SPARK-22897][CORE] Expose stageAttemptId in TaskContext stageAttemptId added in TaskContext and corresponding construction modification Added a new test in TaskContextSuite, two cases are tested: 1. Normal case without failure 2. Exception case with resubmitted stages Link to [SPARK-22897](https://issues.apache.org/jira/browse/SPARK-22897) Author: Xianjin YE Closes #20082 from advancedxy/SPARK-22897. Conflicts: project/MimaExcludes.scala ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/tgravescs/spark SPARK-22897 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21609.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21609 commit 4bc8d2805949b6b9d4d06ff4ad0493d9b33c7063 Author: Xianjin YE Date: 2018-01-02T15:30:38Z [SPARK-22897][CORE] Expose stageAttemptId in TaskContext stageAttemptId added in TaskContext and corresponding construction modification Added a new test in TaskContextSuite, two cases are tested: 1. Normal case without failure 2. Exception case with resubmitted stages Link to [SPARK-22897](https://issues.apache.org/jira/browse/SPARK-22897) Author: Xianjin YE Closes #20082 from advancedxy/SPARK-22897. Conflicts: project/MimaExcludes.scala --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21594: [SPARK-24596][SQL] Non-cascading Cache Invalidation
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21594 **[Test build #92194 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92194/testReport)** for PR 21594 at commit [`2f00f2f`](https://github.com/apache/spark/commit/2f00f2fe0e1cf9a0d44285aab306ed55bd176d9c). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21594: [SPARK-24596][SQL] Non-cascading Cache Invalidation
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21594 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21594: [SPARK-24596][SQL] Non-cascading Cache Invalidation
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21594 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/395/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21606: [SPARK-24552][core][SQL] Use task ID instead of attempt ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21606 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/394/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21606: [SPARK-24552][core][SQL] Use task ID instead of attempt ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21606 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21607: branch-2.1: backport SPARK-24589 and SPARK-22897
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21607 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/393/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21607: branch-2.1: backport SPARK-24589 and SPARK-22897
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21607 **[Test build #92193 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92193/testReport)** for PR 21607 at commit [`0520d60`](https://github.com/apache/spark/commit/0520d60b44987369fa62d7237427cb0cf022ed41). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21607: branch-2.1: backport SPARK-24589 and SPARK-22897
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21607 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21588: [SPARK-24590][BUILD] Make Jenkins tests passed with hado...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21588 The tests were passed in this PR builder. The only hack I used is that I landed a one liner fix to an artifact to use it in this PR, which is already in Hive, and is proposed in Hive's fork which is blocked by non-techinical reason. I am working on this to get through. Okay, if you think it should be blocked, let me get through this first. I am not dropping it. Isn't it what we already cover? I believe this is the most minimised and conservative fix to make Hadoop 3 working within Spark since we already added it. FWIW, we didn't document Hadoop 3 profile yet, so my impression is that it's in progress yet. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21606: [SPARK-24552][core][SQL] Use task ID instead of attempt ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21606 **[Test build #92192 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92192/testReport)** for PR 21606 at commit [`a16d9f9`](https://github.com/apache/spark/commit/a16d9f907b3ce0078da72b7e7bcc56e187cbc8f9). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21606: [SPARK-24552][core][SQL] Use task ID instead of a...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/21606#discussion_r197316565 --- Diff: core/src/main/scala/org/apache/spark/internal/io/SparkHadoopWriter.scala --- @@ -76,13 +76,29 @@ object SparkHadoopWriter extends Logging { // Try to write all RDD partitions as a Hadoop OutputFormat. try { val ret = sparkContext.runJob(rdd, (context: TaskContext, iter: Iterator[(K, V)]) => { +// Generate a positive integer task ID that is unique for the current stage. This makes a +// few assumptions: +// - the task ID is always positive +// - stages cannot have more than Int.MaxValue +// - the sum of task counts of all active stages doesn't exceed Int.MaxValue +// +// The first two are currently the case in Spark, while the last one is very unlikely to +// occur. If it does, two tasks IDs on a single stage could have a clashing integer value, +// which could lead to code that generates clashing file names for different tasks. Still, +// if the commit coordinator is enabled, only one task would be allowed to commit. --- End diff -- Ok, I'll use that. I think Spark might fail everything before you even go that high in attempt numbers anyway... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21588: [SPARK-24590][BUILD] Make Jenkins tests passed with hado...
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/21588 I already explained my view of why I don't think this should get in, in its current form. Passing tests in someone's private environment, for me, is not a worthy goal. You say the fix is needed, but I'm not even sure this is the right fix. You're dropping support for a bunch of Hive versions, effectively. Is that what we want? If it is, you need to properly document that, and fix places where you need a proper error message so users are not confused. If it's not, you need to find a solution to that problem. And for that it would be easier if you could actually test your change here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21607: branch-2.1: backport SPARK-24589 and SPARK-22897
Github user tgravescs commented on a diff in the pull request: https://github.com/apache/spark/pull/21607#discussion_r197315441 --- Diff: core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala --- @@ -97,48 +102,48 @@ private[spark] class OutputCommitCoordinator(conf: SparkConf, isDriver: Boolean) } /** - * Called by the DAGScheduler when a stage starts. + * Called by the DAGScheduler when a stage starts. Initializes the stage's state if it hasn't + * yet been initialized. * * @param stage the stage id. * @param maxPartitionId the maximum partition id that could appear in this stage's tasks (i.e. * the maximum possible value of `context.partitionId`). */ - private[scheduler] def stageStart( - stage: StageId, - maxPartitionId: Int): Unit = { -val arr = new Array[TaskAttemptNumber](maxPartitionId + 1) -java.util.Arrays.fill(arr, NO_AUTHORIZED_COMMITTER) + private[scheduler] def stageStart(stage: Int, maxPartitionId: Int): Unit = synchronized { +val arr = Array.fill[TaskIdentifier](maxPartitionId + 1)(null) synchronized { --- End diff -- we have 2 nested synchronized --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21588: [SPARK-24590][BUILD] Make Jenkins tests passed with hado...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21588 I at least checked if this passed with that fix to fork manually. It fixes everything else that can be fixed in Spark. I wonder why this should be blocked to be honest yet. It can't be ran via Jenkins, which I accept that thiis change should be blocked but this fix is needed anyway and can be unblocked. If something is needed, I just review and merge. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21247: [SPARK-24190][SQL] Allow saving of JSON files in UTF-16 ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21247 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92183/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21247: [SPARK-24190][SQL] Allow saving of JSON files in UTF-16 ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21247 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21247: [SPARK-24190][SQL] Allow saving of JSON files in UTF-16 ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21247 **[Test build #92183 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92183/testReport)** for PR 21247 at commit [`ca1b243`](https://github.com/apache/spark/commit/ca1b24322edd119d1e15b39f79bb15dd22cae482). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class ForeachBatchFunction(object):` * `case class ArrayDistinct(child: Expression)` * `class PythonForeachWriter(func: PythonFunction, schema: StructType)` * ` class UnsafeRowBuffer(taskMemoryManager: TaskMemoryManager, tempDir: File, numFields: Int)` * `trait MemorySinkBase extends BaseStreamingSink with Logging ` * `class MemorySink(val schema: StructType, outputMode: OutputMode, options: DataSourceOptions)` * `class ForeachBatchSink[T](batchWriter: (Dataset[T], Long) => Unit, encoder: ExpressionEncoder[T])` * `trait PythonForeachBatchFunction ` * `case class ForeachWriterProvider[T](` * `case class ForeachWriterFactory[T](` * `class ForeachDataWriter[T](` * `class MemoryWriter(` * `class MemoryStreamWriter(` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21594: [SPARK-24596][SQL] Non-cascading Cache Invalidati...
Github user maryannxue commented on a diff in the pull request: https://github.com/apache/spark/pull/21594#discussion_r197314689 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala --- @@ -801,4 +800,67 @@ class CachedTableSuite extends QueryTest with SQLTestUtils with SharedSQLContext } assert(cachedData.collect === Seq(1001)) } + + test("SPARK-24596 Non-cascading Cache Invalidation - uncache temporary view") { +withView("t1", "t2") { --- End diff -- Yes.. good catch! A mistake caused by copy-paste. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21594: [SPARK-24596][SQL] Non-cascading Cache Invalidati...
Github user maryannxue commented on a diff in the pull request: https://github.com/apache/spark/pull/21594#discussion_r197314556 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetCacheSuite.scala --- @@ -143,9 +153,57 @@ class DatasetCacheSuite extends QueryTest with SharedSQLContext with TimeLimits df.count() df2.cache() -val plan = df2.queryExecution.withCachedData -assert(plan.isInstanceOf[InMemoryRelation]) -val internalPlan = plan.asInstanceOf[InMemoryRelation].cacheBuilder.cachedPlan - assert(internalPlan.find(_.isInstanceOf[InMemoryTableScanExec]).isDefined) +assertCacheDependency(df2) + } + + test("SPARK-24596 Non-cascading Cache Invalidation") { +val df = Seq(("a", 1), ("b", 2)).toDF("s", "i") +val df2 = df.filter('i > 1) +val df3 = df.filter('i < 2) + +df2.cache() +df.cache() +df.count() +df3.cache() + +df.unpersist() + +// df un-cached; df2 and df3's cache plan re-compiled +assert(df.storageLevel == StorageLevel.NONE) +assertCacheDependency(df2, 0) +assertCacheDependency(df3, 0) + } + + test("SPARK-24596 Non-cascading Cache Invalidation - verify cached data reuse") { +val expensiveUDF = udf({ x: Int => Thread.sleep(5000); x }) +val df = spark.range(0, 10).toDF("a") +val df1 = df.withColumn("b", expensiveUDF($"a")) +val df2 = df1.groupBy('a).agg(sum('b)) +val df3 = df.agg(sum('a)) + +df1.cache() +df2.cache() +df2.collect() +df3.cache() + +assertCacheDependency(df2) + +df1.unpersist(blocking = true) + +// df1 un-cached; df2's cache plan re-compiled +assert(df1.storageLevel == StorageLevel.NONE) +assertCacheDependency(df1.groupBy('a).agg(sum('b)), 0) + +val df4 = df1.groupBy('a).agg(sum('b)).select("sum(b)") +assertCached(df4) +// reuse loaded cache +failAfter(3 seconds) { + df4.collect() +} + +val df5 = df.agg(sum('a)).filter($"sum(a)" > 1) +assertCached(df5) +// first time use, load cache +df5.collect() --- End diff -- We just need to prove the new InMemoryRelation works alright for building cache (since the plan has been re-compiled) ... maybe we should check result though. Plus, I deliberately made this dataframe not dependent on the UDF so it can finish quickly. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21192: [SPARK-24118][SQL] Flexible format for the lineSep optio...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21192 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92184/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21192: [SPARK-24118][SQL] Flexible format for the lineSep optio...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21192 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21192: [SPARK-24118][SQL] Flexible format for the lineSep optio...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21192 **[Test build #92184 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92184/testReport)** for PR 21192 at commit [`eab96b4`](https://github.com/apache/spark/commit/eab96b4ed078263d8eb1df6b1204c007f6b4be4a). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class ForeachBatchFunction(object):` * `case class ArrayDistinct(child: Expression)` * `class PythonForeachWriter(func: PythonFunction, schema: StructType)` * ` class UnsafeRowBuffer(taskMemoryManager: TaskMemoryManager, tempDir: File, numFields: Int)` * `trait MemorySinkBase extends BaseStreamingSink with Logging ` * `class MemorySink(val schema: StructType, outputMode: OutputMode, options: DataSourceOptions)` * `class ForeachBatchSink[T](batchWriter: (Dataset[T], Long) => Unit, encoder: ExpressionEncoder[T])` * `trait PythonForeachBatchFunction ` * `case class ForeachWriterProvider[T](` * `case class ForeachWriterFactory[T](` * `class ForeachDataWriter[T](` * `class MemoryWriter(` * `class MemoryStreamWriter(` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21594: [SPARK-24596][SQL] Non-cascading Cache Invalidation
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/21594 LGTM except some comments about test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21594: [SPARK-24596][SQL] Non-cascading Cache Invalidati...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21594#discussion_r197311829 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala --- @@ -801,4 +800,67 @@ class CachedTableSuite extends QueryTest with SQLTestUtils with SharedSQLContext } assert(cachedData.collect === Seq(1001)) } + + test("SPARK-24596 Non-cascading Cache Invalidation - uncache temporary view") { +withView("t1", "t2") { --- End diff -- `withTempView` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21594: [SPARK-24596][SQL] Non-cascading Cache Invalidati...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21594#discussion_r197312423 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetCacheSuite.scala --- @@ -143,9 +153,57 @@ class DatasetCacheSuite extends QueryTest with SharedSQLContext with TimeLimits df.count() df2.cache() -val plan = df2.queryExecution.withCachedData -assert(plan.isInstanceOf[InMemoryRelation]) -val internalPlan = plan.asInstanceOf[InMemoryRelation].cacheBuilder.cachedPlan - assert(internalPlan.find(_.isInstanceOf[InMemoryTableScanExec]).isDefined) +assertCacheDependency(df2) + } + + test("SPARK-24596 Non-cascading Cache Invalidation") { +val df = Seq(("a", 1), ("b", 2)).toDF("s", "i") +val df2 = df.filter('i > 1) +val df3 = df.filter('i < 2) + +df2.cache() +df.cache() +df.count() +df3.cache() + +df.unpersist() + +// df un-cached; df2 and df3's cache plan re-compiled +assert(df.storageLevel == StorageLevel.NONE) +assertCacheDependency(df2, 0) +assertCacheDependency(df3, 0) + } + + test("SPARK-24596 Non-cascading Cache Invalidation - verify cached data reuse") { +val expensiveUDF = udf({ x: Int => Thread.sleep(5000); x }) +val df = spark.range(0, 10).toDF("a") +val df1 = df.withColumn("b", expensiveUDF($"a")) +val df2 = df1.groupBy('a).agg(sum('b)) +val df3 = df.agg(sum('a)) + +df1.cache() +df2.cache() +df2.collect() +df3.cache() + +assertCacheDependency(df2) + +df1.unpersist(blocking = true) + +// df1 un-cached; df2's cache plan re-compiled +assert(df1.storageLevel == StorageLevel.NONE) +assertCacheDependency(df1.groupBy('a).agg(sum('b)), 0) + +val df4 = df1.groupBy('a).agg(sum('b)).select("sum(b)") +assertCached(df4) +// reuse loaded cache +failAfter(3 seconds) { + df4.collect() +} + +val df5 = df.agg(sum('a)).filter($"sum(a)" > 1) +assertCached(df5) +// first time use, load cache +df5.collect() --- End diff -- how do we prove this takes more than 5 seconds? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21594: [SPARK-24596][SQL] Non-cascading Cache Invalidati...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21594#discussion_r197311907 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala --- @@ -801,4 +800,67 @@ class CachedTableSuite extends QueryTest with SQLTestUtils with SharedSQLContext } assert(cachedData.collect === Seq(1001)) } + + test("SPARK-24596 Non-cascading Cache Invalidation - uncache temporary view") { +withView("t1", "t2") { + sql("CACHE TABLE t1 AS SELECT * FROM testData WHERE key > 1") + sql("CACHE TABLE t2 as SELECT * FROM t1 WHERE value > 1") + + assert(spark.catalog.isCached("t1")) + assert(spark.catalog.isCached("t2")) + sql("UNCACHE TABLE t1") + assert(!spark.catalog.isCached("t1")) + assert(spark.catalog.isCached("t2")) +} + } + + test("SPARK-24596 Non-cascading Cache Invalidation - drop temporary view") { +withView("t1", "t2") { --- End diff -- ditto --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21606: [SPARK-24552][core][SQL] Use task ID instead of attempt ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21606 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92181/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21606: [SPARK-24552][core][SQL] Use task ID instead of attempt ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21606 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21606: [SPARK-24552][core][SQL] Use task ID instead of attempt ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21606 **[Test build #92181 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92181/testReport)** for PR 21606 at commit [`7233a5f`](https://github.com/apache/spark/commit/7233a5fd7b154e2a1400c5fac11d0356a22f5f98). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #11105: [SPARK-12469][CORE] Data Property accumulators for Spark
Github user tdyas commented on the issue: https://github.com/apache/spark/pull/11105 I was curious whether there were any active plans to complete this PR? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org