[GitHub] spark issue #21818: [SPARK-24860][SQL] Support setting of partitionOverWrite...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21818 **[Test build #93295 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93295/testReport)** for PR 21818 at commit [`7d0752b`](https://github.com/apache/spark/commit/7d0752bc0f7e53542dec3bbc01a2f4e00e051f42). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21822: [SPARK-24865] Remove AnalysisBarrier - WIP
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21822 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1147/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21822: [SPARK-24865] Remove AnalysisBarrier - WIP
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21822 **[Test build #93307 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93307/testReport)** for PR 21822 at commit [`8ccafca`](https://github.com/apache/spark/commit/8ccafcab20a70df7a625912fdf4e43be7fb87954). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21822: [SPARK-24865] Remove AnalysisBarrier - WIP
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21822 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21813: [SPARK-24424][SQL] Support ANSI-SQL compliant syntax for...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21813 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21813: [SPARK-24424][SQL] Support ANSI-SQL compliant syn...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/21813#discussion_r203914517 --- Diff: sql/core/src/test/resources/sql-tests/inputs/grouping_set.sql --- @@ -13,5 +13,39 @@ SELECT a, b, c, count(d) FROM grouping GROUP BY a, b, c GROUPING SETS ((a)); -- SPARK-17849: grouping set throws NPE #3 SELECT a, b, c, count(d) FROM grouping GROUP BY a, b, c GROUPING SETS ((c)); +-- Group sets without explicit group by +SELECT c1, sum(c2) FROM (VALUES ('x', 10, 0), ('y', 20, 0)) AS t (c1, c2, c3) GROUP BY GROUPING SETS (c1); +-- Group sets without group by and with grouping +SELECT c1, sum(c2), grouping(c1) FROM (VALUES ('x', 10, 0), ('y', 20, 0)) AS t (c1, c2, c3) GROUP BY GROUPING SETS (c1); + +-- Mutiple grouping within a grouping set +SELECT c1, c2, Sum(c3), grouping__id +FROM (VALUES ('x', 'a', 10), ('y', 'b', 20) ) AS t (c1, c2, c3) +GROUP BY GROUPING SETS ( ( c1 ), ( c2 ) ) +HAVING GROUPING__ID > 1; + +-- Group sets without explicit group by +SELECT grouping(c1) FROM (VALUES ('x', 'a', 10), ('y', 'b', 20)) AS t (c1, c2, c3) GROUP BY c1,c2 GROUPING SETS (c1,c2); --- End diff -- Doesn't this have explicit group by? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21813: [SPARK-24424][SQL] Support ANSI-SQL compliant syntax for...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21813 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93299/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21652: [SPARK-24551][K8S] Add integration tests for secrets
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21652 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93298/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21652: [SPARK-24551][K8S] Add integration tests for secrets
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21652 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21823: [SPARK-24870][SQL]Cache can't work normally if there are...
Github user eatoncys commented on the issue: https://github.com/apache/spark/pull/21823 cc @cloud-fan @gatorsmile --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21774: [SPARK-24811][SQL]Avro: add new function from_avr...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21774#discussion_r203923037 --- Diff: external/avro/src/main/scala/org/apache/spark/sql/avro/package.scala --- @@ -36,4 +40,27 @@ package object avro { @scala.annotation.varargs def avro(sources: String*): DataFrame = reader.format("avro").load(sources: _*) } + + /** + * Converts a binary column of avro format into its corresponding catalyst value. The specified + * schema must match the read data, otherwise the behavior is undefined: it may fail or return + * arbitrary result. + * --- End diff -- Shall we add `@since`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21822: [SPARK-24865] Remove AnalysisBarrier - WIP
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/21822#discussion_r203924609 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala --- @@ -533,7 +537,8 @@ trait CheckAnalysis extends PredicateHelper { // Simplify the predicates before validating any unsupported correlation patterns // in the plan. -BooleanSimplification(sub).foreachUp { +// TODO(rxin): Why did this need to call BooleanSimplification??? --- End diff -- @rxin From what i remember Reynold, most of this logic was housed in Analyzer before and we moved it to optimizer. In the old code we used to walk the plan after simplifying the predicates. The comment used to read "Simplify the predicates before pulling them out.". I just retained that semantics. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21821: [SPARK-24867] [SQL] Add AnalysisBarrier to DataFrameWrit...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21821 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21821: [SPARK-24867] [SQL] Add AnalysisBarrier to DataFrameWrit...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21821 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93309/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21533: [SPARK-24195][Core] Ignore the files with "local"...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/21533 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21822: [SPARK-24865] Remove AnalysisBarrier - WIP
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21822 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21822: [SPARK-24865] Remove AnalysisBarrier - WIP
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21822 **[Test build #93310 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93310/testReport)** for PR 21822 at commit [`738e99c`](https://github.com/apache/spark/commit/738e99c615f45cfb6e5882a822dafcb8472b78ea). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21653: [SPARK-13343] speculative tasks that didn't commit shoul...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21653 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21320: [SPARK-4502][SQL] Parquet nested column pruning -...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21320#discussion_r203933307 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruningSuite.scala --- @@ -0,0 +1,156 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.parquet + +import java.io.File + +import org.apache.spark.sql.{QueryTest, Row} +import org.apache.spark.sql.execution.FileSchemaPruningTest +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.test.SharedSQLContext + +class ParquetSchemaPruningSuite +extends QueryTest +with ParquetTest +with FileSchemaPruningTest +with SharedSQLContext { --- End diff -- Can we just simply: ```scala override def beforeEach(): Unit = { super.beforeEach() spark.conf.set(SQLConf. NESTED_SCHEMA_PRUNING_ENABLED.key, "true") } override def afterEach(): Unit = { try { spark.conf.unset(SQLConf.NESTED_SCHEMA_PRUNING_ENABLED.key) } finally { super.afterEach() } } ``` without the complicated hierarchy? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21320: [SPARK-4502][SQL] Parquet nested column pruning -...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21320#discussion_r203934281 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala --- @@ -47,16 +47,25 @@ import org.apache.spark.sql.types._ * * Due to this reason, we no longer rely on [[ReadContext]] to pass requested schema from [[init()]] * to [[prepareForRead()]], but use a private `var` for simplicity. + * + * @param parquetMrCompatibility support reading with parquet-mr or Spark's built-in Parquet reader */ -private[parquet] class ParquetReadSupport(val convertTz: Option[TimeZone]) +private[parquet] class ParquetReadSupport(val convertTz: Option[TimeZone], +parquetMrCompatibility: Boolean) extends ReadSupport[UnsafeRow] with Logging { private var catalystRequestedSchema: StructType = _ + /** + * Construct a [[ParquetReadSupport]] with [[convertTz]] set to [[None]] and + * [[parquetMrCompatibility]] set to [[false]]. + * + * We need a zero-arg constructor for SpecificParquetRecordReaderBase. But that is only + * used in the vectorized reader, where we get the convertTz value directly, and the value here + * is ignored. Further, we set [[parquetMrCompatibility]] to [[false]] as this constructor is only + * called by the Spark reader. --- End diff -- re: https://github.com/apache/spark/pull/21320/files/cb858f202e49d69f2044681e37f982dc10676296#r199631341 actually, it doesn't looks clear to me too. What does the flag indicate? you mean normal parquet reader vs vectorized parquet reader? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21818: [SPARK-24860][SQL] Support setting of partitionOv...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21818#discussion_r203934743 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala --- @@ -91,8 +92,12 @@ case class InsertIntoHadoopFsRelationCommand( val pathExists = fs.exists(qualifiedOutputPath) -val enableDynamicOverwrite = - sparkSession.sessionState.conf.partitionOverwriteMode == PartitionOverwriteMode.DYNAMIC +val parameters = CaseInsensitiveMap(options) + +val enableDynamicOverwrite = parameters.get("partitionOverwriteMode") + .map(mode => PartitionOverwriteMode.withName(mode.toUpperCase)) + .getOrElse(sparkSession.sessionState.conf.partitionOverwriteMode) == + PartitionOverwriteMode.DYNAMIC --- End diff -- nit: this is too long ``` val partitionOverwriteMode = parameters.get("partitionOverwriteMode") val enableDynamicOverwrite = partitionOverwriteMode == PartitionOverwriteMode.DYNAMIC ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21824: [SPARK-24871][SQL] Refactor Concat and MapConcat to avoi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21824 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1153/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21818: [SPARK-24860][SQL] Support setting of partitionOv...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21818#discussion_r203934766 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/sources/InsertSuite.scala --- @@ -486,6 +486,25 @@ class InsertSuite extends DataSourceTest with SharedSQLContext { } } } + test("SPARK-24860: dynamic partition overwrite specified per source without catalog table") { --- End diff -- nit: add a blank file before this new test case --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21820: [SPARK-24868][PYTHON]add sequence function in Python
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21820 **[Test build #93317 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93317/testReport)** for PR 21820 at commit [`725c1b7`](https://github.com/apache/spark/commit/725c1b74d13309a567698b24150d1e7e06662769). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21824: [SPARK-24871][SQL] Refactor Concat and MapConcat to avoi...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21824 **[Test build #93316 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93316/testReport)** for PR 21824 at commit [`644a30d`](https://github.com/apache/spark/commit/644a30d32e390f39f25515bb251e120e69502f27). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21818: [SPARK-24860][SQL] Support setting of partitionOv...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21818#discussion_r203934987 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/sources/InsertSuite.scala --- @@ -486,6 +486,25 @@ class InsertSuite extends DataSourceTest with SharedSQLContext { } } } + test("SPARK-24860: dynamic partition overwrite specified per source without catalog table") { +withTempPath { path => + Seq((1, 1, 1)).toDF("i", "part1", "part2") --- End diff -- can we simplify the test? ideally we only need one partition column, and write some initial data to the table. then do an overwrite with partitionOverwriteMode=dynamic, and another overwrite with partitionOverwriteMode=static. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21820: [SPARK-24868][PYTHON]add sequence function in Python
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21820 **[Test build #93317 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93317/testReport)** for PR 21820 at commit [`725c1b7`](https://github.com/apache/spark/commit/725c1b74d13309a567698b24150d1e7e06662769). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21820: [SPARK-24868][PYTHON]add sequence function in Python
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21820 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21733: [SPARK-24763][SS] Remove redundant key data from value i...
Github user HeartSaVioR commented on the issue: https://github.com/apache/spark/pull/21733 Now I'd like to propose changing default behavior to apply new path but keeping backward compatibility, so applied it to the patch. I'm still open on decision to apply it as advanced option as first approach, and happy to roll back when we decide on that way. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20146: [SPARK-11215][ML] Add multiple columns support to...
GitHub user viirya reopened a pull request: https://github.com/apache/spark/pull/20146 [SPARK-11215][ML] Add multiple columns support to StringIndexer ## What changes were proposed in this pull request? This takes over #19621 to add multi-column support to StringIndexer. ## How was this patch tested? Added tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/viirya/spark-1 SPARK-11215 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20146.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20146 commit bb990f1a6511d8ce20f4fff254dfe0ff43262a10 Author: Liang-Chi Hsieh Date: 2018-01-03T03:51:59Z Add multi-column support to StringIndexer. commit 26cc94bb335cf0ba3bcdbc2b78effd447026792c Author: Liang-Chi Hsieh Date: 2018-01-07T01:42:28Z Fix glm test. commit 540c364d2a70ecd6ee5b92fadedc5e9b85026d2c Author: Liang-Chi Hsieh Date: 2018-01-16T08:20:19Z Merge remote-tracking branch 'upstream/master' into SPARK-11215 commit 18acbbf7b70b87c75ba62be863580fe9accc23b4 Author: Liang-Chi Hsieh Date: 2018-01-24T12:03:26Z Improve test cases. commit b884fb5c0ce1e627390d08d8425721ea8e4d Author: Liang-Chi Hsieh Date: 2018-01-27T00:58:06Z Merge remote-tracking branch 'upstream/master' into SPARK-11215 commit 76ff7bf6054a687abd9fc16c8044020a5454d95f Author: Liang-Chi Hsieh Date: 2018-04-19T11:27:21Z Merge remote-tracking branch 'upstream/master' into SPARK-11215 commit 50af02eaccce7cecb7c3093d5bc14675ca860c22 Author: Liang-Chi Hsieh Date: 2018-04-19T11:30:46Z Change from 2.3 to 2.4. commit c1be2c7e28ebdfed580577a108d2f254834caed7 Author: Liang-Chi Hsieh Date: 2018-04-23T10:15:49Z Address comments. commit ed35d875414ba3cf8751a77463f61665e9c373b0 Author: Liang-Chi Hsieh Date: 2018-04-23T14:00:16Z Address comment. commit a1dcfda85243a1e2210177f2acfb78821c539b17 Author: Liang-Chi Hsieh Date: 2018-04-24T06:41:07Z Use SQL Aggregator for counting string labels. commit c1685228f7ec7f4904bca67efaee70498b9894c8 Author: Liang-Chi Hsieh Date: 2018-04-25T13:28:24Z Merge remote-tracking branch 'upstream/master' into SPARK-11215 commit a6551b02a10428d66e0dadcfcb5a8da3798ec814 Author: Liang-Chi Hsieh Date: 2018-04-26T04:13:09Z Drop NA values for both frequency and alphabet order types. commit c003bd3d6c58cf19249ff0ba9dd10140971d655c Author: Liang-Chi Hsieh Date: 2018-07-18T18:58:21Z Merge remote-tracking branch 'upstream/master' into SPARK-11215 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20146: [SPARK-11215][ML] Add multiple columns support to String...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/20146 @HyukjinKwon Thanks. Closed and re-opened now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21774: [SPARK-24811][SQL]Avro: add new function from_avr...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21774#discussion_r203943047 --- Diff: external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala --- @@ -0,0 +1,64 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql + +import org.apache.avro.generic.GenericDatumReader +import org.apache.avro.io.{BinaryDecoder, DecoderFactory} + +import org.apache.spark.sql.avro.{AvroDeserializer, SchemaConverters, SerializableSchema} +import org.apache.spark.sql.catalyst.expressions.{ExpectsInputTypes, Expression, UnaryExpression} +import org.apache.spark.sql.catalyst.expressions.codegen.{CodegenContext, CodeGenerator, ExprCode} +import org.apache.spark.sql.types.{AbstractDataType, BinaryType, DataType} + +case class AvroDataToCatalyst(child: Expression, avroType: SerializableSchema) + extends UnaryExpression with ExpectsInputTypes { + + override def inputTypes: Seq[AbstractDataType] = Seq(BinaryType) + + override lazy val dataType: DataType = +SchemaConverters.toSqlType(avroType.value).dataType + + override def nullable: Boolean = true + + @transient private lazy val reader = new GenericDatumReader[Any](avroType.value) + + @transient private lazy val deserializer = new AvroDeserializer(avroType.value, dataType) + + @transient private var decoder: BinaryDecoder = _ + + @transient private var result: Any = _ + + override def nullSafeEval(input: Any): Any = { +val binary = input.asInstanceOf[Array[Byte]] +decoder = DecoderFactory.get().binaryDecoder(binary, 0, binary.length, decoder) +result = reader.read(result, decoder) +deserializer.deserialize(result) + } + + override def simpleString: String = { +s"from_avro(${child.sql}, ${dataType.simpleString})" --- End diff -- It's not used in `sql`, we can override `sql` here and use untruncated version. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21817: [SPARK-24861][SS][test] create corrected temp dir...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/21817 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21818: [SPARK-24860][SQL] Support setting of partitionOverWrite...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21818 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21818: [SPARK-24860][SQL] Support setting of partitionOverWrite...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21818 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93295/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21822: [SPARK-24865] Remove AnalysisBarrier - WIP
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21822 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1148/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21440: [SPARK-24307][CORE] Support reading remote cached...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/21440#discussion_r203914040 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -659,6 +659,11 @@ private[spark] class BlockManager( * Get block from remote block managers as serialized bytes. */ def getRemoteBytes(blockId: BlockId): Option[ChunkedByteBuffer] = { +// TODO if we change this method to return the ManagedBuffer, then getRemoteValues +// could just use the inputStream on the temp file, rather than memory-mapping the file. +// Until then, replication can cause the process to use too much memory and get killed +// by the OS / cluster manager (not a java OOM, since its a memory-mapped file) even though +// we've read the data to disk. --- End diff -- I see. I agree with you that YARN could have some issues in calculating the exact memory usage. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21813: [SPARK-24424][SQL] Support ANSI-SQL compliant syn...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/21813#discussion_r203917186 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -442,17 +442,35 @@ class Analyzer( child: LogicalPlan): LogicalPlan = { val gid = AttributeReference(VirtualColumn.groupingIdName, IntegerType, false)() + // In case of ANSI-SQL compliant syntax for GROUPING SETS, groupByExprs is optional and + // can be null. In such case, we derive the groupByExprs from the user supplied values for + // grouping sets. + val finalGroupByExpressions = if (groupByExprs == Nil) { +selectedGroupByExprs.flatten.foldLeft(Seq.empty[Expression]) { (result, currentExpr) => --- End diff -- @viirya No. We should be getting an error as we don't have a group by specification. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21775: [SPARK-24812][SQL] Last Access Time in the table ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21775#discussion_r203921304 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala --- @@ -2250,6 +2251,22 @@ class HiveDDLSuite } } + test("desc formatted table for last access verification") { --- End diff -- let's name it `SPARK-24812: desc formatted table for last access verification` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21823: [SPARK-24870][SQL]Cache can't work normally if there are...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21823 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1151/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21823: [SPARK-24870][SQL]Cache can't work normally if there are...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21823 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21809: [SPARK-24851] : Map a Stage ID to it's Associated...
Github user jiangxb1987 commented on a diff in the pull request: https://github.com/apache/spark/pull/21809#discussion_r203936074 --- Diff: core/src/main/scala/org/apache/spark/status/AppStatusStore.scala --- @@ -94,6 +94,13 @@ private[spark] class AppStatusStore( }.toSeq } + def getJobIdsAssociatedWithStage(stageId: Int): Seq[Set[Int]] = { --- End diff -- We don't need to fetch all the stage attempts, just the first of it is enough to get all the jobIds. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21822: [SPARK-24865] Remove AnalysisBarrier - WIP
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21822 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93307/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21813: [SPARK-24424][SQL] Support ANSI-SQL compliant syntax for...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21813 **[Test build #93299 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93299/testReport)** for PR 21813 at commit [`7cf187d`](https://github.com/apache/spark/commit/7cf187db02a54bcfd3b44e0710d95462b273ea97). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21652: [SPARK-24551][K8S] Add integration tests for secrets
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21652 **[Test build #93298 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93298/testReport)** for PR 21652 at commit [`67df340`](https://github.com/apache/spark/commit/67df340d943d38afd1ea4c12c02b417b5434970f). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21775: [SPARK-24812][SQL] Last Access Time in the table descrip...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21775 Seems making sense. > seems to be a limitation as of now even from hive, better we can follow the hive behavior unless the limitation has been resolved from hive. Do you maybe know related Hive side ticket or can you point me out the related codes? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21775: [SPARK-24812][SQL] Last Access Time in the table descrip...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21775 ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21774: [SPARK-24811][SQL]Avro: add new function from_avr...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21774#discussion_r203922243 --- Diff: external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala --- @@ -0,0 +1,64 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql + +import org.apache.avro.generic.GenericDatumReader +import org.apache.avro.io.{BinaryDecoder, DecoderFactory} + +import org.apache.spark.sql.avro.{AvroDeserializer, SchemaConverters, SerializableSchema} +import org.apache.spark.sql.catalyst.expressions.{ExpectsInputTypes, Expression, UnaryExpression} +import org.apache.spark.sql.catalyst.expressions.codegen.{CodegenContext, CodeGenerator, ExprCode} +import org.apache.spark.sql.types.{AbstractDataType, BinaryType, DataType} + +case class AvroDataToCatalyst(child: Expression, avroType: SerializableSchema) + extends UnaryExpression with ExpectsInputTypes { + + override def inputTypes: Seq[AbstractDataType] = Seq(BinaryType) + + override lazy val dataType: DataType = +SchemaConverters.toSqlType(avroType.value).dataType + + override def nullable: Boolean = true + + @transient private lazy val reader = new GenericDatumReader[Any](avroType.value) + + @transient private lazy val deserializer = new AvroDeserializer(avroType.value, dataType) + + @transient private var decoder: BinaryDecoder = _ + + @transient private var result: Any = _ + + override def nullSafeEval(input: Any): Any = { +val binary = input.asInstanceOf[Array[Byte]] +decoder = DecoderFactory.get().binaryDecoder(binary, 0, binary.length, decoder) +result = reader.read(result, decoder) +deserializer.deserialize(result) + } + + override def simpleString: String = { +s"from_avro(${child.sql}, ${dataType.simpleString})" --- End diff -- Shall we use catalogString for datatype? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21746: [SPARK-24699] [SS]Make watermarks work with Trigger.Once...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21746 **[Test build #93300 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93300/testReport)** for PR 21746 at commit [`584c96e`](https://github.com/apache/spark/commit/584c96e5e8ecbeeb4ae4eefe1184606173b83ade). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21746: [SPARK-24699] [SS]Make watermarks work with Trigger.Once...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21746 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21746: [SPARK-24699] [SS]Make watermarks work with Trigger.Once...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21746 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93300/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21746: [SPARK-24699] [SS]Make watermarks work with Trigger.Once...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21746 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21820: [SPARK-24868][PYTHON]add sequence function in Pyt...
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/21820#discussion_r203934505 --- Diff: python/pyspark/sql/functions.py --- @@ -2551,6 +2551,27 @@ def map_concat(*cols): return Column(jc) +@since(2.4) +def sequence(start, stop, step=None): +""" +Generate a sequence of integers from start to stop, incrementing by step. +If step is not set, incrementing by 1 if start is less than or equal to stop, otherwise -1. + +>>> df1 = spark.createDataFrame([(-2, 2)], ('C1', 'C2')) +>>> df1.select(sequence('C1', 'C2').alias('r')).collect() +[Row(r=[-2, -1, 0, 1, 2])] +>>> df2 = spark.createDataFrame([(4, -4, -2)], ('C1', 'C2', 'C3')) --- End diff -- It will throw ```java.lang.IllegalArgumentException```. There is a check in ```collectionOperations.scala``` ``` require( (step > num.zero && start <= stop) || (step < num.zero && start >= stop) || (step == num.zero && start == stop), s"Illegal sequence boundaries: $start to $stop by $step") ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21824: [SPARK-24871][SQL] Refactor Concat and MapConcat ...
GitHub user ueshin opened a pull request: https://github.com/apache/spark/pull/21824 [SPARK-24871][SQL] Refactor Concat and MapConcat to avoid creating concatenator object for each row. ## What changes were proposed in this pull request? Refactor `Concat` and `MapConcat` to: - avoid creating concatenator object for each row. - make `Concat` handle `containsNull` properly. - make `Concat` shortcut if `null` child is found. ## How was this patch tested? Added some tests and existing tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ueshin/apache-spark issues/SPARK-24871/refactor_concat_mapconcat Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21824.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21824 commit c4d4b28756d93ecc5c1e9a50e705e2a9cb4b8460 Author: Takuya UESHIN Date: 2018-07-19T08:32:15Z Refactor `Concat` and `MapConcat` to avoid creating concatenator for each row. commit 644a30d32e390f39f25515bb251e120e69502f27 Author: Takuya UESHIN Date: 2018-07-19T10:38:20Z Make `Concat` shortcut if null child is found. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21824: [SPARK-24871][SQL] Refactor Concat and MapConcat to avoi...
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/21824 also cc @cloud-fan @gatorsmile --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21802: [SPARK-23928][SQL] Add shuffle collection function.
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/21802 cc @cloud-fan @gatorsmile --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21821: [SPARK-24867] [SQL] Add AnalysisBarrier to DataFrameWrit...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21821 **[Test build #93319 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93319/testReport)** for PR 21821 at commit [`9edc28f`](https://github.com/apache/spark/commit/9edc28fdcb7261f01db716f65e723668a493327e). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21774: [SPARK-24811][SQL]Avro: add new function from_avr...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21774#discussion_r203938936 --- Diff: external/avro/src/main/scala/org/apache/spark/sql/avro/package.scala --- @@ -36,4 +40,27 @@ package object avro { @scala.annotation.varargs def avro(sources: String*): DataFrame = reader.format("avro").load(sources: _*) } + + /** + * Converts a binary column of avro format into its corresponding catalyst value. The specified + * schema must match the read data, otherwise the behavior is undefined: it may fail or return + * arbitrary result. + * + * @param data the binary column. + * @param avroType the avro type. + */ + @Experimental + def from_avro(data: Column, avroType: Schema): Column = { --- End diff -- `Column` can support arbitrary expressions. Personally I don't like these string-based overloads... Since it's in an external module, I think we can only add python/R/SQL version when we have a persistent UDF API. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21820: [SPARK-24868][PYTHON]add sequence function in Python
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21820 **[Test build #93303 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93303/testReport)** for PR 21820 at commit [`a7c434c`](https://github.com/apache/spark/commit/a7c434c49248703e182ae9904d344246b13f7b87). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21813: [SPARK-24424][SQL] Support ANSI-SQL compliant syn...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/21813#discussion_r203917039 --- Diff: sql/core/src/test/resources/sql-tests/inputs/grouping_set.sql --- @@ -13,5 +13,39 @@ SELECT a, b, c, count(d) FROM grouping GROUP BY a, b, c GROUPING SETS ((a)); -- SPARK-17849: grouping set throws NPE #3 SELECT a, b, c, count(d) FROM grouping GROUP BY a, b, c GROUPING SETS ((c)); +-- Group sets without explicit group by +SELECT c1, sum(c2) FROM (VALUES ('x', 10, 0), ('y', 20, 0)) AS t (c1, c2, c3) GROUP BY GROUPING SETS (c1); +-- Group sets without group by and with grouping +SELECT c1, sum(c2), grouping(c1) FROM (VALUES ('x', 10, 0), ('y', 20, 0)) AS t (c1, c2, c3) GROUP BY GROUPING SETS (c1); + +-- Mutiple grouping within a grouping set +SELECT c1, c2, Sum(c3), grouping__id +FROM (VALUES ('x', 'a', 10), ('y', 'b', 20) ) AS t (c1, c2, c3) +GROUP BY GROUPING SETS ( ( c1 ), ( c2 ) ) +HAVING GROUPING__ID > 1; + +-- Group sets without explicit group by +SELECT grouping(c1) FROM (VALUES ('x', 'a', 10), ('y', 'b', 20)) AS t (c1, c2, c3) GROUP BY c1,c2 GROUPING SETS (c1,c2); + +-- Mutiple grouping within a grouping set +SELECT -c1 AS c1 FROM (values (1,2), (3,2)) t(c1, c2) GROUP BY GROUPING SETS ((c1), (c1, c2)); + +-- complex expression in grouping sets +SELECT a + b, b, sum(c) FROM (VALUES (1,1,1),(2,2,2)) AS t(a,b,c) GROUP BY GROUPING SETS ( (a + b), (b)); + +-- complex expression in grouping sets +SELECT a + b, b, sum(c) FROM (VALUES (1,1,1),(2,2,2)) AS t(a,b,c) GROUP BY GROUPING SETS ( (a + b), (b + a), (b)); + +-- more query constructs with grouping sets +SELECT c1 AS col1, c2 AS col2 +FROM (VALUES (1, 2), (3, 2)) t(c1, c2) +GROUP BY GROUPING SETS ( ( c1 ), ( c1, c2 ) ) +HAVING col2 IS NOT NULL +ORDER BY -col1; --- End diff -- @viirya Sorry Simon.. do i have to do something for this comment ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21813: [SPARK-24424][SQL] Support ANSI-SQL compliant syntax for...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21813 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21813: [SPARK-24424][SQL] Support ANSI-SQL compliant syntax for...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21813 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1152/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21822: [SPARK-24865] Remove AnalysisBarrier - WIP
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21822 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93310/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21320: [SPARK-4502][SQL] Parquet nested column pruning -...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21320#discussion_r203934633 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala --- @@ -288,6 +310,27 @@ private[parquet] object ParquetReadSupport { } } + /** + * Computes the structural intersection between two Parquet group types. + */ + private def intersectParquetGroups( + groupType1: GroupType, groupType2: GroupType): Option[GroupType] = { +val fields = + groupType1.getFields.asScala +.filter(field => groupType2.containsField(field.getName)) +.flatMap { + case field1: GroupType => --- End diff -- `field1` -> `field`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21824: [SPARK-24871][SQL] Refactor Concat and MapConcat to avoi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21824 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21824: [SPARK-24871][SQL] Refactor Concat and MapConcat to avoi...
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/21824 cc @mn-mikke @bersprockets --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21774: [SPARK-24811][SQL]Avro: add new function from_avr...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21774#discussion_r203938957 --- Diff: external/avro/src/main/scala/org/apache/spark/sql/avro/package.scala --- @@ -36,4 +40,27 @@ package object avro { @scala.annotation.varargs def avro(sources: String*): DataFrame = reader.format("avro").load(sources: _*) } + + /** + * Converts a binary column of avro format into its corresponding catalyst value. The specified + * schema must match the read data, otherwise the behavior is undefined: it may fail or return + * arbitrary result. + * --- End diff -- +1 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21774: [SPARK-24811][SQL]Avro: add new function from_avr...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21774#discussion_r203938964 --- Diff: external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala --- @@ -0,0 +1,64 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql + +import org.apache.avro.generic.GenericDatumReader +import org.apache.avro.io.{BinaryDecoder, DecoderFactory} + +import org.apache.spark.sql.avro.{AvroDeserializer, SchemaConverters, SerializableSchema} +import org.apache.spark.sql.catalyst.expressions.{ExpectsInputTypes, Expression, UnaryExpression} +import org.apache.spark.sql.catalyst.expressions.codegen.{CodegenContext, CodeGenerator, ExprCode} +import org.apache.spark.sql.types.{AbstractDataType, BinaryType, DataType} + +case class AvroDataToCatalyst(child: Expression, avroType: SerializableSchema) + extends UnaryExpression with ExpectsInputTypes { + + override def inputTypes: Seq[AbstractDataType] = Seq(BinaryType) + + override lazy val dataType: DataType = +SchemaConverters.toSqlType(avroType.value).dataType + + override def nullable: Boolean = true + + @transient private lazy val reader = new GenericDatumReader[Any](avroType.value) + + @transient private lazy val deserializer = new AvroDeserializer(avroType.value, dataType) + + @transient private var decoder: BinaryDecoder = _ + + @transient private var result: Any = _ + + override def nullSafeEval(input: Any): Any = { +val binary = input.asInstanceOf[Array[Byte]] +decoder = DecoderFactory.get().binaryDecoder(binary, 0, binary.length, decoder) +result = reader.read(result, decoder) +deserializer.deserialize(result) + } + + override def simpleString: String = { +s"from_avro(${child.sql}, ${dataType.simpleString})" --- End diff -- but this is being used in `sql` though. Do we prefer truncated string form in `sql` too? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21822: [SPARK-24865] Remove AnalysisBarrier - WIP
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21822 **[Test build #93320 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93320/testReport)** for PR 21822 at commit [`83ffa51`](https://github.com/apache/spark/commit/83ffa51f4b165152dea214be4d73dd518d742a56). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21821: [SPARK-24867] [SQL] Add AnalysisBarrier to DataFr...
GitHub user gatorsmile opened a pull request: https://github.com/apache/spark/pull/21821 [SPARK-24867] [SQL] Add AnalysisBarrier to DataFrameWriter ## What changes were proposed in this pull request? ```Scala val udf1 = udf({(x: Int, y: Int) => x + y}) val df = spark.range(0, 3).toDF("a") .withColumn("b", udf1($"a", udf1($"a", lit(10 df.cache() df.write.saveAsTable("t") ``` Cache is not being used because the plans do not match with the cached plan. This is a regression caused by the changes we made in AnalysisBarrier, since not all the Analyzer rules are idempotent. ## How was this patch tested? Added a test. Also found a bug in the DSV1 write path. This is not a regression. Thus, opened a separate JIRA https://issues.apache.org/jira/browse/SPARK-24869 You can merge this pull request into a Git repository by running: $ git pull https://github.com/gatorsmile/spark testMaster22 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21821.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21821 commit 23ec09fc3bbedd2f34c594daf461cebd9c0295a6 Author: Xiao Li Date: 2018-07-19T23:38:44Z fix --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21820: [SPARK-24868][PYTHON]add sequence function in Python
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21820 **[Test build #93303 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93303/testReport)** for PR 21820 at commit [`a7c434c`](https://github.com/apache/spark/commit/a7c434c49248703e182ae9904d344246b13f7b87). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21822: [SPARK-24865] Remove AnalysisBarrier - WIP
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21822 **[Test build #93306 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93306/testReport)** for PR 21822 at commit [`f6f2bcc`](https://github.com/apache/spark/commit/f6f2bccd6d887b50b035c492c88e9e76d1ef4754). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21820: [SPARK-24868][PYTHON]add sequence function in Python
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21820 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93303/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21822: [SPARK-24865] Remove AnalysisBarrier - WIP
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21822 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93308/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21546: [SPARK-23030][SQL][PYTHON] Use Arrow stream forma...
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/21546#discussion_r203913178 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -3349,20 +3385,20 @@ class Dataset[T] private[sql]( } } - /** Convert to an RDD of ArrowPayload byte arrays */ - private[sql] def toArrowPayload(plan: SparkPlan): RDD[ArrowPayload] = { + /** Convert to an RDD of serialized ArrowRecordBatches. */ + private[sql] def getArrowBatchRdd(plan: SparkPlan): RDD[Array[Byte]] = { --- End diff -- Yeah, I can't remember why I changed it.. but I think you're right it so I'll change it back. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21823: [SPARK-24870][SQL]Cache can't work normally if th...
GitHub user eatoncys opened a pull request: https://github.com/apache/spark/pull/21823 [SPARK-24870][SQL]Cache can't work normally if there are case letters in SQL ## What changes were proposed in this pull request? Modified the canonicalized to not case-insensitive. Before the PR, cache can't work normally if there are case letters in SQL, for example: sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING) USING hive") sql("select key, sum(case when Key > 0 then 1 else 0 end) as positiveNum " + "from src group by key").cache().createOrReplaceTempView("src_cache") sql( s"""select a.key from (select key from src_cache where positiveNum = 1)a left join (select key from src_cache )b on a.key=b.key """).explain The physical plan of the sql is: ![image](https://user-images.githubusercontent.com/26834091/42979518-3decf0fa-8c05-11e8-9837-d5e4c334cb1f.png) The subquery "select key from src_cache where positiveNum = 1" on the left of join can use the cache data, but the subquery "select key from src_cache" on the right of join cannot use the cache data. ## How was this patch tested? new added test You can merge this pull request into a Git repository by running: $ git pull https://github.com/eatoncys/spark canonicalized Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21823.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21823 commit 2b2a5a33ed58ce07fd2515eb01e80acbedeb8b2a Author: 10129659 Date: 2018-07-20T01:43:53Z Cache can't work normally if there are case letters in SQL --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21533: [SPARK-24195][Core] Ignore the files with "local" scheme...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21533 SGTM too --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21821: [SPARK-24867] [SQL] Add AnalysisBarrier to DataFrameWrit...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21821 **[Test build #93305 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93305/testReport)** for PR 21821 at commit [`23ec09f`](https://github.com/apache/spark/commit/23ec09fc3bbedd2f34c594daf461cebd9c0295a6). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21822: [SPARK-24865] Remove AnalysisBarrier - WIP
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21822 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21775: [SPARK-24812][SQL] Last Access Time in the table descrip...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21775 **[Test build #93311 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93311/testReport)** for PR 21775 at commit [`b527fdc`](https://github.com/apache/spark/commit/b527fdc5919296ffa12e1be54367b9132ecee61e). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21775: [SPARK-24812][SQL] Last Access Time in the table ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21775#discussion_r203920736 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala --- @@ -2250,6 +2251,22 @@ class HiveDDLSuite } } + test("desc formatted table for last access verification") { +withTable("t1") { + sql(s"create table" + +s" if not exists t1 (c1_int int, c2_string string, c3_float float)") --- End diff -- nit: ```scala sql( "CREATE TABLE IF NOT EXISTS t1 (c1_int INT, c2_string STRING, c3_float FLOAT)") ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21767: SPARK-24804 There are duplicate words in the test title ...
Github user httfighter commented on the issue: https://github.com/apache/spark/pull/21767 Thank you for your comments, @srowen @HyukjinKwon @wangyum. I will try to contribute more valuable issues! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21820: [SPARK-24868][PYTHON]add sequence function in Pyt...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21820#discussion_r203924999 --- Diff: python/pyspark/sql/functions.py --- @@ -2551,6 +2551,27 @@ def map_concat(*cols): return Column(jc) +@since(2.4) +def sequence(start, stop, step=None): +""" +Generate a sequence of integers from start to stop, incrementing by step. +If step is not set, incrementing by 1 if start is less than or equal to stop, otherwise -1. + +>>> df1 = spark.createDataFrame([(-2, 2)], ('C1', 'C2')) +>>> df1.select(sequence('C1', 'C2').alias('r')).collect() +[Row(r=[-2, -1, 0, 1, 2])] +>>> df2 = spark.createDataFrame([(4, -4, -2)], ('C1', 'C2', 'C3')) --- End diff -- What happens if the `step` is positive in this case? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21813: [SPARK-24424][SQL] Support ANSI-SQL compliant syntax for...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21813 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93304/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21653: [SPARK-13343] speculative tasks that didn't commit shoul...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21653 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93302/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21733: [SPARK-24763][SS] Remove redundant key data from value i...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21733 **[Test build #93315 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93315/testReport)** for PR 21733 at commit [`977428c`](https://github.com/apache/spark/commit/977428cb35a6fc0a9fa7a0ca1a51e39a94447a01). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21320: [SPARK-4502][SQL] Parquet nested column pruning -...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21320#discussion_r203933423 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/planning/SelectedFieldSuite.scala --- @@ -0,0 +1,387 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.planning + +import org.scalatest.BeforeAndAfterAll +import org.scalatest.exceptions.TestFailedException + +import org.apache.spark.SparkFunSuite +import org.apache.spark.sql.catalyst.dsl.plans._ +import org.apache.spark.sql.catalyst.expressions.NamedExpression +import org.apache.spark.sql.catalyst.parser.CatalystSqlParser +import org.apache.spark.sql.catalyst.plans.logical.LocalRelation +import org.apache.spark.sql.types._ + +// scalastyle:off line.size.limit +class SelectedFieldSuite extends SparkFunSuite with BeforeAndAfterAll { + // The test schema as a tree string, i.e. `schema.treeString` + // root + // |-- col1: string (nullable = false) + // |-- col2: struct (nullable = true) + // ||-- field1: integer (nullable = true) + // ||-- field2: array (nullable = true) + // |||-- element: integer (containsNull = false) + // ||-- field3: array (nullable = false) + // |||-- element: struct (containsNull = true) + // ||||-- subfield1: integer (nullable = true) + // ||||-- subfield2: integer (nullable = true) + // ||||-- subfield3: array (nullable = true) + // |||||-- element: integer (containsNull = true) + // ||-- field4: map (nullable = true) + // |||-- key: string + // |||-- value: struct (valueContainsNull = false) + // ||||-- subfield1: integer (nullable = true) + // ||||-- subfield2: array (nullable = true) + // |||||-- element: integer (containsNull = false) + // ||-- field5: array (nullable = false) + // |||-- element: struct (containsNull = true) + // ||||-- subfield1: struct (nullable = false) + // |||||-- subsubfield1: integer (nullable = true) + // |||||-- subsubfield2: integer (nullable = true) + // ||||-- subfield2: struct (nullable = true) + // |||||-- subsubfield1: struct (nullable = true) + // ||||||-- subsubsubfield1: string (nullable = true) + // |||||-- subsubfield2: integer (nullable = true) + // ||-- field6: struct (nullable = true) + // |||-- subfield1: string (nullable = false) + // |||-- subfield2: string (nullable = true) + // ||-- field7: struct (nullable = true) + // |||-- subfield1: struct (nullable = true) + // ||||-- subsubfield1: integer (nullable = true) + // ||||-- subsubfield2: integer (nullable = true) + // ||-- field8: map (nullable = true) + // |||-- key: string + // |||-- value: array (valueContainsNull = false) + // ||||-- element: struct (containsNull = true) + // |||||-- subfield1: integer (nullable = true) + // |||||-- subfield2: array (nullable = true) + // ||||||-- element: integer (containsNull = false) + // ||-- field9: map (nullable = true) + // |||-- key: string + // |||-- value: integer (valueContainsNull = false) + // |-- col3: array (nullable = false) + // ||-- element: struct (containsNull = false) + // |||-- field1: struct (nullable = true) + // ||||-- subfield1: integer (nullable = false) + // ||||-- subfield2: integer (nullable = true) + // |||-- field2: map (nullable = true) + // ||||-- key: string + // ||||-- value: integer
[GitHub] spark issue #21823: [SPARK-24870][SQL]Cache can't work normally if there are...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/21823 do you know why it's happening? It's super weird that `select key from src_cache where positiveNum = 1` can hit the cache but `select key from src_cache` can not. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21774: [SPARK-24811][SQL]Avro: add new function from_avr...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21774#discussion_r203938730 --- Diff: external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala --- @@ -0,0 +1,64 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql + +import org.apache.avro.generic.GenericDatumReader +import org.apache.avro.io.{BinaryDecoder, DecoderFactory} + +import org.apache.spark.sql.avro.{AvroDeserializer, SchemaConverters, SerializableSchema} +import org.apache.spark.sql.catalyst.expressions.{ExpectsInputTypes, Expression, UnaryExpression} +import org.apache.spark.sql.catalyst.expressions.codegen.{CodegenContext, CodeGenerator, ExprCode} +import org.apache.spark.sql.types.{AbstractDataType, BinaryType, DataType} + +case class AvroDataToCatalyst(child: Expression, avroType: SerializableSchema) + extends UnaryExpression with ExpectsInputTypes { + + override def inputTypes: Seq[AbstractDataType] = Seq(BinaryType) + + override lazy val dataType: DataType = +SchemaConverters.toSqlType(avroType.value).dataType + + override def nullable: Boolean = true + + @transient private lazy val reader = new GenericDatumReader[Any](avroType.value) + + @transient private lazy val deserializer = new AvroDeserializer(avroType.value, dataType) + + @transient private var decoder: BinaryDecoder = _ + + @transient private var result: Any = _ + + override def nullSafeEval(input: Any): Any = { +val binary = input.asInstanceOf[Array[Byte]] +decoder = DecoderFactory.get().binaryDecoder(binary, 0, binary.length, decoder) +result = reader.read(result, decoder) +deserializer.deserialize(result) + } + + override def simpleString: String = { +s"from_avro(${child.sql}, ${dataType.simpleString})" --- End diff -- IIRC `simpleString` will be used in the plan string and should not be too long. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21821: [SPARK-24867] [SQL] Add AnalysisBarrier to DataFrameWrit...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21821 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21821: [SPARK-24867] [SQL] Add AnalysisBarrier to DataFr...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21821#discussion_r203905148 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala --- @@ -254,7 +254,7 @@ final class DataFrameWriter[T] private[sql](ds: Dataset[T]) { val writer = ws.createWriter(jobId, df.logicalPlan.schema, mode, options) if (writer.isPresent) { runCommand(df.sparkSession, "save") { - WriteToDataSourceV2(writer.get(), df.logicalPlan) + WriteToDataSourceV2(writer.get(), df.planWithBarrier) --- End diff -- This change is not needed but it is safe to have. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21821: [SPARK-24867] [SQL] Add AnalysisBarrier to DataFrameWrit...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21821 **[Test build #93305 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93305/testReport)** for PR 21821 at commit [`23ec09f`](https://github.com/apache/spark/commit/23ec09fc3bbedd2f34c594daf461cebd9c0295a6). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21820: [SPARK-24868][PYTHON]add sequence function in Python
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21820 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21822: [SPARK-24865] Remove AnalysisBarrier - WIP
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21822 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93306/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21822: [SPARK-24865] Remove AnalysisBarrier - WIP
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21822 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21546: [SPARK-23030][SQL][PYTHON] Use Arrow stream format for c...
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/21546 >@BryanCutler, this takes longer then I thought. Will complete my review till this weekend. For clarification, still no objection about merging it in orthogonally with my review. No problem, thanks @HyukjinKwon ! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21821: [SPARK-24867] [SQL] Add AnalysisBarrier to DataFrameWrit...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21821 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21821: [SPARK-24867] [SQL] Add AnalysisBarrier to DataFrameWrit...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21821 **[Test build #93309 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93309/testReport)** for PR 21821 at commit [`4030e17`](https://github.com/apache/spark/commit/4030e174e6efeeae6158f13675234ed994276371). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21821: [SPARK-24867] [SQL] Add AnalysisBarrier to DataFrameWrit...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21821 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1149/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21821: [SPARK-24867] [SQL] Add AnalysisBarrier to DataFrameWrit...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21821 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org