[GitHub] spark pull request #21389: [SPARK-24204][SQL] Verify a schema in Json/Orc/Pa...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/21389 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21389: [SPARK-24204][SQL] Verify a schema in Json/Orc/Pa...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21389#discussion_r198355457 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala --- @@ -202,4 +204,230 @@ class FileBasedDataSourceSuite extends QueryTest with SharedSQLContext with Befo } } } + + // Unsupported data types of csv, json, orc, and parquet are as follows; + // csv -> R/W: Interval, Null, Array, Map, Struct + // json -> W: Interval + // orc -> W: Interval, Null + // parquet -> R/W: Interval, Null + test("SPARK-24204 error handling for unsupported Array/Map/Struct types - csv") { +withTempDir { dir => + val csvDir = new File(dir, "csv").getCanonicalPath --- End diff -- Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21389: [SPARK-24204][SQL] Verify a schema in Json/Orc/Pa...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21389#discussion_r198355370 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceUtils.scala --- @@ -0,0 +1,101 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import org.apache.spark.sql.execution.datasources.csv.CSVFileFormat +import org.apache.spark.sql.execution.datasources.json.JsonFileFormat +import org.apache.spark.sql.execution.datasources.orc.OrcFileFormat +import org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat +import org.apache.spark.sql.types._ + + +object DataSourceUtils { + + /** + * Verify if the schema is supported in datasource in write path. + */ + def verifyWriteSchema(format: FileFormat, schema: StructType): Unit = { +verifySchema(format, schema, isReadPath = false) + } + + /** + * Verify if the schema is supported in datasource in read path. + */ + def verifyReadSchema(format: FileFormat, schema: StructType): Unit = { +verifySchema(format, schema, isReadPath = true) + } + + /** + * Verify if the schema is supported in datasource. This verification should be done + * in a driver side, e.g., `prepareWrite`, `buildReader`, and `buildReaderWithPartitionValues` + * in `FileFormat`. + * + * Unsupported data types of csv, json, orc, and parquet are as follows; + * csv -> R/W: Interval, Null, Array, Map, Struct + * json -> W: Interval + * orc -> W: Interval, Null + * parquet -> R/W: Interval, Null + */ + private def verifySchema(format: FileFormat, schema: StructType, isReadPath: Boolean): Unit = { +def throwUnsupportedException(dataType: DataType): Unit = { + throw new UnsupportedOperationException( +s"$format data source does not support ${dataType.simpleString} data type.") +} + +def verifyType(dataType: DataType): Unit = dataType match { + case BooleanType | ByteType | ShortType | IntegerType | LongType | FloatType | DoubleType | + StringType | BinaryType | DateType | TimestampType | _: DecimalType => + + case _: CalendarIntervalType | _: StructType | _: ArrayType | _: MapType + if format.isInstanceOf[CSVFileFormat] => +throwUnsupportedException(dataType) + + case st: StructType => st.foreach { f => verifyType(f.dataType) } + + case ArrayType(elementType, _) => verifyType(elementType) + + case MapType(keyType, valueType, _) => +verifyType(keyType) +verifyType(valueType) + + // JSON and ORC don't support an Interval type, but we pass it in read pass + // for back-compatibility. + case _: CalendarIntervalType if isReadPath && --- End diff -- Thanks for the comment. Based on the suggestion, I modified code. Could you check again? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21389: [SPARK-24204][SQL] Verify a schema in Json/Orc/Pa...
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/21389#discussion_r198351604 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala --- @@ -202,4 +204,230 @@ class FileBasedDataSourceSuite extends QueryTest with SharedSQLContext with Befo } } } + + // Unsupported data types of csv, json, orc, and parquet are as follows; + // csv -> R/W: Interval, Null, Array, Map, Struct + // json -> W: Interval + // orc -> W: Interval, Null + // parquet -> R/W: Interval, Null + test("SPARK-24204 error handling for unsupported Array/Map/Struct types - csv") { +withTempDir { dir => + val csvDir = new File(dir, "csv").getCanonicalPath --- End diff -- No, you are right. I misunderstood. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21389: [SPARK-24204][SQL] Verify a schema in Json/Orc/Pa...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21389#discussion_r198350214 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala --- @@ -202,4 +204,230 @@ class FileBasedDataSourceSuite extends QueryTest with SharedSQLContext with Befo } } } + + // Unsupported data types of csv, json, orc, and parquet are as follows; + // csv -> R/W: Interval, Null, Array, Map, Struct + // json -> W: Interval + // orc -> W: Interval, Null + // parquet -> R/W: Interval, Null + test("SPARK-24204 error handling for unsupported Array/Map/Struct types - csv") { +withTempDir { dir => + val csvDir = new File(dir, "csv").getCanonicalPath + var msg = intercept[UnsupportedOperationException] { +Seq((1, "Tesla")).toDF("a", "b").selectExpr("struct(a, b)").write.csv(csvDir) + }.getMessage + assert(msg.contains("CSV data source does not support struct data type")) + + msg = intercept[UnsupportedOperationException] { +val schema = StructType.fromDDL("a struct") +spark.range(1).write.mode("overwrite").csv(csvDir) +spark.read.schema(schema).csv(csvDir).collect() + }.getMessage + assert(msg.contains("CSV data source does not support struct data type")) + + msg = intercept[UnsupportedOperationException] { +Seq((1, Map("Tesla" -> 3))).toDF("id", "cars").write.mode("overwrite").csv(csvDir) + }.getMessage + assert(msg.contains("CSV data source does not support map data type")) + + msg = intercept[UnsupportedOperationException] { +val schema = StructType.fromDDL("a map") +spark.range(1).write.mode("overwrite").csv(csvDir) +spark.read.schema(schema).csv(csvDir).collect() + }.getMessage + assert(msg.contains("CSV data source does not support map data type")) + + msg = intercept[UnsupportedOperationException] { +Seq((1, Array("Tesla", "Chevy", "Ford"))).toDF("id", "brands") + .write.mode("overwrite").csv(csvDir) + }.getMessage + assert(msg.contains("CSV data source does not support array data type")) + + msg = intercept[UnsupportedOperationException] { + val schema = StructType.fromDDL("a array") + spark.range(1).write.mode("overwrite").csv(csvDir) + spark.read.schema(schema).csv(csvDir).collect() + }.getMessage + assert(msg.contains("CSV data source does not support array data type")) + + msg = intercept[UnsupportedOperationException] { +Seq((1, new UDT.MyDenseVector(Array(0.25, 2.25, 4.25.toDF("id", "vectors") + .write.mode("overwrite").csv(csvDir) + }.getMessage + assert(msg.contains("CSV data source does not support array data type")) + + msg = intercept[UnsupportedOperationException] { +val schema = StructType(StructField("a", new UDT.MyDenseVectorUDT(), true) :: Nil) +spark.range(1).write.mode("overwrite").csv(csvDir) +spark.read.schema(schema).csv(csvDir).collect() + }.getMessage + assert(msg.contains("CSV data source does not support array data type.")) +} + } + + test("SPARK-24204 error handling for unsupported Interval data types - csv, json, parquet, orc") { +withTempDir { dir => + val tempDir = new File(dir, "files").getCanonicalPath + + Seq("orc", "json").foreach { format => --- End diff -- fixed --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21389: [SPARK-24204][SQL] Verify a schema in Json/Orc/Pa...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21389#discussion_r198350225 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala --- @@ -202,4 +204,230 @@ class FileBasedDataSourceSuite extends QueryTest with SharedSQLContext with Befo } } } + + // Unsupported data types of csv, json, orc, and parquet are as follows; + // csv -> R/W: Interval, Null, Array, Map, Struct + // json -> W: Interval + // orc -> W: Interval, Null + // parquet -> R/W: Interval, Null + test("SPARK-24204 error handling for unsupported Array/Map/Struct types - csv") { +withTempDir { dir => + val csvDir = new File(dir, "csv").getCanonicalPath + var msg = intercept[UnsupportedOperationException] { +Seq((1, "Tesla")).toDF("a", "b").selectExpr("struct(a, b)").write.csv(csvDir) + }.getMessage + assert(msg.contains("CSV data source does not support struct data type")) + + msg = intercept[UnsupportedOperationException] { +val schema = StructType.fromDDL("a struct") +spark.range(1).write.mode("overwrite").csv(csvDir) +spark.read.schema(schema).csv(csvDir).collect() + }.getMessage + assert(msg.contains("CSV data source does not support struct data type")) + + msg = intercept[UnsupportedOperationException] { +Seq((1, Map("Tesla" -> 3))).toDF("id", "cars").write.mode("overwrite").csv(csvDir) + }.getMessage + assert(msg.contains("CSV data source does not support map data type")) + + msg = intercept[UnsupportedOperationException] { +val schema = StructType.fromDDL("a map") +spark.range(1).write.mode("overwrite").csv(csvDir) +spark.read.schema(schema).csv(csvDir).collect() + }.getMessage + assert(msg.contains("CSV data source does not support map data type")) + + msg = intercept[UnsupportedOperationException] { +Seq((1, Array("Tesla", "Chevy", "Ford"))).toDF("id", "brands") + .write.mode("overwrite").csv(csvDir) + }.getMessage + assert(msg.contains("CSV data source does not support array data type")) + + msg = intercept[UnsupportedOperationException] { + val schema = StructType.fromDDL("a array") + spark.range(1).write.mode("overwrite").csv(csvDir) + spark.read.schema(schema).csv(csvDir).collect() + }.getMessage + assert(msg.contains("CSV data source does not support array data type")) + + msg = intercept[UnsupportedOperationException] { +Seq((1, new UDT.MyDenseVector(Array(0.25, 2.25, 4.25.toDF("id", "vectors") + .write.mode("overwrite").csv(csvDir) + }.getMessage + assert(msg.contains("CSV data source does not support array data type")) + + msg = intercept[UnsupportedOperationException] { +val schema = StructType(StructField("a", new UDT.MyDenseVectorUDT(), true) :: Nil) +spark.range(1).write.mode("overwrite").csv(csvDir) +spark.read.schema(schema).csv(csvDir).collect() + }.getMessage + assert(msg.contains("CSV data source does not support array data type.")) +} + } + + test("SPARK-24204 error handling for unsupported Interval data types - csv, json, parquet, orc") { +withTempDir { dir => + val tempDir = new File(dir, "files").getCanonicalPath + + Seq("orc", "json").foreach { format => +// write path +var msg = intercept[AnalysisException] { + sql("select interval 1 days").write.format(format).mode("overwrite").save(tempDir) +}.getMessage +assert(msg.contains("Cannot save interval data type into external storage.")) + +msg = intercept[UnsupportedOperationException] { + spark.udf.register("testType", () => new IntervalData()) + sql("select testType()").write.format(format).mode("overwrite").save(tempDir) +}.getMessage +assert(msg.toLowerCase(Locale.ROOT) + .contains(s"$format data source does not support calendarinterval data type.")) + +// read path +// We expect the types below should be passed for backward-compatibility + +// Interval type +var schema = StructType(StructField("a", CalendarIntervalType, true) :: Nil) +spark.range(1).write.format(format).mode("overwrite").save(tempDir) +spark.read.schema(schema).format(format).load(tempDir).collect() + +// UDT having interval data +schema = StructType(StructField("a", new IntervalUDT(), true) :: Nil) +
[GitHub] spark pull request #21389: [SPARK-24204][SQL] Verify a schema in Json/Orc/Pa...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21389#discussion_r198348474 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala --- @@ -202,4 +204,230 @@ class FileBasedDataSourceSuite extends QueryTest with SharedSQLContext with Befo } } } + + // Unsupported data types of csv, json, orc, and parquet are as follows; + // csv -> R/W: Interval, Null, Array, Map, Struct + // json -> W: Interval + // orc -> W: Interval, Null + // parquet -> R/W: Interval, Null + test("SPARK-24204 error handling for unsupported Array/Map/Struct types - csv") { +withTempDir { dir => + val csvDir = new File(dir, "csv").getCanonicalPath --- End diff -- This tests are moved from `CSVSuite`: https://github.com/apache/spark/pull/21389/files#diff-219ac8201e443435499123f96e94d29fL743 Do I misunderstand your comment? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21389: [SPARK-24204][SQL] Verify a schema in Json/Orc/Pa...
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/21389#discussion_r198342819 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala --- @@ -202,4 +204,230 @@ class FileBasedDataSourceSuite extends QueryTest with SharedSQLContext with Befo } } } + + // Unsupported data types of csv, json, orc, and parquet are as follows; + // csv -> R/W: Interval, Null, Array, Map, Struct + // json -> W: Interval + // orc -> W: Interval, Null + // parquet -> R/W: Interval, Null + test("SPARK-24204 error handling for unsupported Array/Map/Struct types - csv") { +withTempDir { dir => + val csvDir = new File(dir, "csv").getCanonicalPath + var msg = intercept[UnsupportedOperationException] { +Seq((1, "Tesla")).toDF("a", "b").selectExpr("struct(a, b)").write.csv(csvDir) + }.getMessage + assert(msg.contains("CSV data source does not support struct data type")) + + msg = intercept[UnsupportedOperationException] { +val schema = StructType.fromDDL("a struct") +spark.range(1).write.mode("overwrite").csv(csvDir) +spark.read.schema(schema).csv(csvDir).collect() + }.getMessage + assert(msg.contains("CSV data source does not support struct data type")) + + msg = intercept[UnsupportedOperationException] { +Seq((1, Map("Tesla" -> 3))).toDF("id", "cars").write.mode("overwrite").csv(csvDir) + }.getMessage + assert(msg.contains("CSV data source does not support map data type")) + + msg = intercept[UnsupportedOperationException] { +val schema = StructType.fromDDL("a map") +spark.range(1).write.mode("overwrite").csv(csvDir) +spark.read.schema(schema).csv(csvDir).collect() + }.getMessage + assert(msg.contains("CSV data source does not support map data type")) + + msg = intercept[UnsupportedOperationException] { +Seq((1, Array("Tesla", "Chevy", "Ford"))).toDF("id", "brands") + .write.mode("overwrite").csv(csvDir) + }.getMessage + assert(msg.contains("CSV data source does not support array data type")) + + msg = intercept[UnsupportedOperationException] { + val schema = StructType.fromDDL("a array") + spark.range(1).write.mode("overwrite").csv(csvDir) + spark.read.schema(schema).csv(csvDir).collect() + }.getMessage + assert(msg.contains("CSV data source does not support array data type")) + + msg = intercept[UnsupportedOperationException] { +Seq((1, new UDT.MyDenseVector(Array(0.25, 2.25, 4.25.toDF("id", "vectors") + .write.mode("overwrite").csv(csvDir) + }.getMessage + assert(msg.contains("CSV data source does not support array data type")) + + msg = intercept[UnsupportedOperationException] { +val schema = StructType(StructField("a", new UDT.MyDenseVectorUDT(), true) :: Nil) +spark.range(1).write.mode("overwrite").csv(csvDir) +spark.read.schema(schema).csv(csvDir).collect() + }.getMessage + assert(msg.contains("CSV data source does not support array data type.")) +} + } + + test("SPARK-24204 error handling for unsupported Interval data types - csv, json, parquet, orc") { +withTempDir { dir => + val tempDir = new File(dir, "files").getCanonicalPath + + Seq("orc", "json").foreach { format => --- End diff -- Also there is a space indent here, which should be removed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21389: [SPARK-24204][SQL] Verify a schema in Json/Orc/Pa...
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/21389#discussion_r198341944 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala --- @@ -202,4 +204,230 @@ class FileBasedDataSourceSuite extends QueryTest with SharedSQLContext with Befo } } } + + // Unsupported data types of csv, json, orc, and parquet are as follows; + // csv -> R/W: Interval, Null, Array, Map, Struct + // json -> W: Interval + // orc -> W: Interval, Null + // parquet -> R/W: Interval, Null + test("SPARK-24204 error handling for unsupported Array/Map/Struct types - csv") { +withTempDir { dir => + val csvDir = new File(dir, "csv").getCanonicalPath + var msg = intercept[UnsupportedOperationException] { +Seq((1, "Tesla")).toDF("a", "b").selectExpr("struct(a, b)").write.csv(csvDir) + }.getMessage + assert(msg.contains("CSV data source does not support struct data type")) + + msg = intercept[UnsupportedOperationException] { +val schema = StructType.fromDDL("a struct") +spark.range(1).write.mode("overwrite").csv(csvDir) +spark.read.schema(schema).csv(csvDir).collect() + }.getMessage + assert(msg.contains("CSV data source does not support struct data type")) + + msg = intercept[UnsupportedOperationException] { +Seq((1, Map("Tesla" -> 3))).toDF("id", "cars").write.mode("overwrite").csv(csvDir) + }.getMessage + assert(msg.contains("CSV data source does not support map data type")) + + msg = intercept[UnsupportedOperationException] { +val schema = StructType.fromDDL("a map") +spark.range(1).write.mode("overwrite").csv(csvDir) +spark.read.schema(schema).csv(csvDir).collect() + }.getMessage + assert(msg.contains("CSV data source does not support map data type")) + + msg = intercept[UnsupportedOperationException] { +Seq((1, Array("Tesla", "Chevy", "Ford"))).toDF("id", "brands") + .write.mode("overwrite").csv(csvDir) + }.getMessage + assert(msg.contains("CSV data source does not support array data type")) + + msg = intercept[UnsupportedOperationException] { + val schema = StructType.fromDDL("a array") + spark.range(1).write.mode("overwrite").csv(csvDir) + spark.read.schema(schema).csv(csvDir).collect() + }.getMessage + assert(msg.contains("CSV data source does not support array data type")) + + msg = intercept[UnsupportedOperationException] { +Seq((1, new UDT.MyDenseVector(Array(0.25, 2.25, 4.25.toDF("id", "vectors") + .write.mode("overwrite").csv(csvDir) + }.getMessage + assert(msg.contains("CSV data source does not support array data type")) + + msg = intercept[UnsupportedOperationException] { +val schema = StructType(StructField("a", new UDT.MyDenseVectorUDT(), true) :: Nil) +spark.range(1).write.mode("overwrite").csv(csvDir) +spark.read.schema(schema).csv(csvDir).collect() + }.getMessage + assert(msg.contains("CSV data source does not support array data type.")) +} + } + + test("SPARK-24204 error handling for unsupported Interval data types - csv, json, parquet, orc") { +withTempDir { dir => + val tempDir = new File(dir, "files").getCanonicalPath + + Seq("orc", "json").foreach { format => +// write path +var msg = intercept[AnalysisException] { + sql("select interval 1 days").write.format(format).mode("overwrite").save(tempDir) +}.getMessage +assert(msg.contains("Cannot save interval data type into external storage.")) + +msg = intercept[UnsupportedOperationException] { + spark.udf.register("testType", () => new IntervalData()) + sql("select testType()").write.format(format).mode("overwrite").save(tempDir) +}.getMessage +assert(msg.toLowerCase(Locale.ROOT) + .contains(s"$format data source does not support calendarinterval data type.")) + +// read path +// We expect the types below should be passed for backward-compatibility + +// Interval type +var schema = StructType(StructField("a", CalendarIntervalType, true) :: Nil) +spark.range(1).write.format(format).mode("overwrite").save(tempDir) +spark.read.schema(schema).format(format).load(tempDir).collect() + +// UDT having interval data +schema = StructType(StructField("a", new IntervalUDT(), true) :: Nil) +
[GitHub] spark pull request #21389: [SPARK-24204][SQL] Verify a schema in Json/Orc/Pa...
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/21389#discussion_r198341922 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala --- @@ -202,4 +204,230 @@ class FileBasedDataSourceSuite extends QueryTest with SharedSQLContext with Befo } } } + + // Unsupported data types of csv, json, orc, and parquet are as follows; + // csv -> R/W: Interval, Null, Array, Map, Struct + // json -> W: Interval + // orc -> W: Interval, Null + // parquet -> R/W: Interval, Null + test("SPARK-24204 error handling for unsupported Array/Map/Struct types - csv") { +withTempDir { dir => + val csvDir = new File(dir, "csv").getCanonicalPath + var msg = intercept[UnsupportedOperationException] { +Seq((1, "Tesla")).toDF("a", "b").selectExpr("struct(a, b)").write.csv(csvDir) + }.getMessage + assert(msg.contains("CSV data source does not support struct data type")) + + msg = intercept[UnsupportedOperationException] { +val schema = StructType.fromDDL("a struct") +spark.range(1).write.mode("overwrite").csv(csvDir) +spark.read.schema(schema).csv(csvDir).collect() + }.getMessage + assert(msg.contains("CSV data source does not support struct data type")) + + msg = intercept[UnsupportedOperationException] { +Seq((1, Map("Tesla" -> 3))).toDF("id", "cars").write.mode("overwrite").csv(csvDir) + }.getMessage + assert(msg.contains("CSV data source does not support map data type")) + + msg = intercept[UnsupportedOperationException] { +val schema = StructType.fromDDL("a map") +spark.range(1).write.mode("overwrite").csv(csvDir) +spark.read.schema(schema).csv(csvDir).collect() + }.getMessage + assert(msg.contains("CSV data source does not support map data type")) + + msg = intercept[UnsupportedOperationException] { +Seq((1, Array("Tesla", "Chevy", "Ford"))).toDF("id", "brands") + .write.mode("overwrite").csv(csvDir) + }.getMessage + assert(msg.contains("CSV data source does not support array data type")) + + msg = intercept[UnsupportedOperationException] { + val schema = StructType.fromDDL("a array") + spark.range(1).write.mode("overwrite").csv(csvDir) + spark.read.schema(schema).csv(csvDir).collect() + }.getMessage + assert(msg.contains("CSV data source does not support array data type")) + + msg = intercept[UnsupportedOperationException] { +Seq((1, new UDT.MyDenseVector(Array(0.25, 2.25, 4.25.toDF("id", "vectors") + .write.mode("overwrite").csv(csvDir) + }.getMessage + assert(msg.contains("CSV data source does not support array data type")) + + msg = intercept[UnsupportedOperationException] { +val schema = StructType(StructField("a", new UDT.MyDenseVectorUDT(), true) :: Nil) +spark.range(1).write.mode("overwrite").csv(csvDir) +spark.read.schema(schema).csv(csvDir).collect() + }.getMessage + assert(msg.contains("CSV data source does not support array data type.")) +} + } + + test("SPARK-24204 error handling for unsupported Interval data types - csv, json, parquet, orc") { +withTempDir { dir => + val tempDir = new File(dir, "files").getCanonicalPath + + Seq("orc", "json").foreach { format => --- End diff -- Nit: we can put all the write path together to reduce duplicated code --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21389: [SPARK-24204][SQL] Verify a schema in Json/Orc/Pa...
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/21389#discussion_r198341823 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala --- @@ -202,4 +204,230 @@ class FileBasedDataSourceSuite extends QueryTest with SharedSQLContext with Befo } } } + + // Unsupported data types of csv, json, orc, and parquet are as follows; + // csv -> R/W: Interval, Null, Array, Map, Struct + // json -> W: Interval + // orc -> W: Interval, Null + // parquet -> R/W: Interval, Null + test("SPARK-24204 error handling for unsupported Array/Map/Struct types - csv") { +withTempDir { dir => + val csvDir = new File(dir, "csv").getCanonicalPath --- End diff -- This is duplicated with `CSVSuite`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21389: [SPARK-24204][SQL] Verify a schema in Json/Orc/Pa...
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/21389#discussion_r198341248 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala --- @@ -202,4 +204,230 @@ class FileBasedDataSourceSuite extends QueryTest with SharedSQLContext with Befo } } } + + // Unsupported data types of csv, json, orc, and parquet are as follows; + // csv -> R/W: Interval, Null, Array, Map, Struct + // json -> W: Interval + // orc -> W: Interval, Null + // parquet -> R/W: Interval, Null + test("SPARK-24204 error handling for unsupported Array/Map/Struct types - csv") { +withTempDir { dir => + val csvDir = new File(dir, "csv").getCanonicalPath + var msg = intercept[UnsupportedOperationException] { +Seq((1, "Tesla")).toDF("a", "b").selectExpr("struct(a, b)").write.csv(csvDir) + }.getMessage + assert(msg.contains("CSV data source does not support struct data type")) + + msg = intercept[UnsupportedOperationException] { +val schema = StructType.fromDDL("a struct") +spark.range(1).write.mode("overwrite").csv(csvDir) +spark.read.schema(schema).csv(csvDir).collect() + }.getMessage + assert(msg.contains("CSV data source does not support struct data type")) + + msg = intercept[UnsupportedOperationException] { +Seq((1, Map("Tesla" -> 3))).toDF("id", "cars").write.mode("overwrite").csv(csvDir) + }.getMessage + assert(msg.contains("CSV data source does not support map data type")) + + msg = intercept[UnsupportedOperationException] { +val schema = StructType.fromDDL("a map") +spark.range(1).write.mode("overwrite").csv(csvDir) +spark.read.schema(schema).csv(csvDir).collect() + }.getMessage + assert(msg.contains("CSV data source does not support map data type")) + + msg = intercept[UnsupportedOperationException] { +Seq((1, Array("Tesla", "Chevy", "Ford"))).toDF("id", "brands") + .write.mode("overwrite").csv(csvDir) + }.getMessage + assert(msg.contains("CSV data source does not support array data type")) + + msg = intercept[UnsupportedOperationException] { + val schema = StructType.fromDDL("a array") + spark.range(1).write.mode("overwrite").csv(csvDir) + spark.read.schema(schema).csv(csvDir).collect() + }.getMessage + assert(msg.contains("CSV data source does not support array data type")) + + msg = intercept[UnsupportedOperationException] { +Seq((1, new UDT.MyDenseVector(Array(0.25, 2.25, 4.25.toDF("id", "vectors") + .write.mode("overwrite").csv(csvDir) + }.getMessage + assert(msg.contains("CSV data source does not support array data type")) + + msg = intercept[UnsupportedOperationException] { +val schema = StructType(StructField("a", new UDT.MyDenseVectorUDT(), true) :: Nil) +spark.range(1).write.mode("overwrite").csv(csvDir) +spark.read.schema(schema).csv(csvDir).collect() + }.getMessage + assert(msg.contains("CSV data source does not support array data type.")) +} + } + + test("SPARK-24204 error handling for unsupported Interval data types - csv, json, parquet, orc") { +withTempDir { dir => + val tempDir = new File(dir, "files").getCanonicalPath + + Seq("orc", "json").foreach { format => +// write path --- End diff -- space indent --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21389: [SPARK-24204][SQL] Verify a schema in Json/Orc/Pa...
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/21389#discussion_r198340297 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceUtils.scala --- @@ -0,0 +1,101 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import org.apache.spark.sql.execution.datasources.csv.CSVFileFormat +import org.apache.spark.sql.execution.datasources.json.JsonFileFormat +import org.apache.spark.sql.execution.datasources.orc.OrcFileFormat +import org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat +import org.apache.spark.sql.types._ + + +object DataSourceUtils { + + /** + * Verify if the schema is supported in datasource in write path. + */ + def verifyWriteSchema(format: FileFormat, schema: StructType): Unit = { +verifySchema(format, schema, isReadPath = false) + } + + /** + * Verify if the schema is supported in datasource in read path. + */ + def verifyReadSchema(format: FileFormat, schema: StructType): Unit = { +verifySchema(format, schema, isReadPath = true) + } + + /** + * Verify if the schema is supported in datasource. This verification should be done + * in a driver side, e.g., `prepareWrite`, `buildReader`, and `buildReaderWithPartitionValues` + * in `FileFormat`. + * + * Unsupported data types of csv, json, orc, and parquet are as follows; + * csv -> R/W: Interval, Null, Array, Map, Struct + * json -> W: Interval + * orc -> W: Interval, Null + * parquet -> R/W: Interval, Null + */ + private def verifySchema(format: FileFormat, schema: StructType, isReadPath: Boolean): Unit = { +def throwUnsupportedException(dataType: DataType): Unit = { + throw new UnsupportedOperationException( +s"$format data source does not support ${dataType.simpleString} data type.") +} + +def verifyType(dataType: DataType): Unit = dataType match { + case BooleanType | ByteType | ShortType | IntegerType | LongType | FloatType | DoubleType | + StringType | BinaryType | DateType | TimestampType | _: DecimalType => + + case _: CalendarIntervalType | _: StructType | _: ArrayType | _: MapType + if format.isInstanceOf[CSVFileFormat] => +throwUnsupportedException(dataType) + + case st: StructType => st.foreach { f => verifyType(f.dataType) } + + case ArrayType(elementType, _) => verifyType(elementType) + + case MapType(keyType, valueType, _) => +verifyType(keyType) +verifyType(valueType) + + // JSON and ORC don't support an Interval type, but we pass it in read pass + // for back-compatibility. + case _: CalendarIntervalType if isReadPath && --- End diff -- If `isReadPath` is `false`, we should always throw exception. So we can simplify all the handling of `CalendarIntervalType` as following: ``` case _: CalendarIntervalType if !isReadPath => throwUnsupportedException(dataType) case _: CalendarIntervalType if format.isInstanceOf[JsonFileFormat] | format.isInstanceOf[OrcFileFormat] => throwUnsupportedException(dataType) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21389: [SPARK-24204][SQL] Verify a schema in Json/Orc/Pa...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21389#discussion_r197626497 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceUtils.scala --- @@ -0,0 +1,88 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import org.apache.spark.sql.execution.datasources.csv.CSVFileFormat +import org.apache.spark.sql.execution.datasources.json.JsonFileFormat +import org.apache.spark.sql.execution.datasources.orc.OrcFileFormat +import org.apache.spark.sql.types._ + + +object DataSourceUtils { + + /** + * Verify if the schema is supported in datasource in write path. + */ + def verifyWriteSchema(format: FileFormat, schema: StructType): Unit = { +verifySchema(format, schema, isReadPath = false) + } + + /** + * Verify if the schema is supported in datasource in read path. + */ + def verifyReadSchema(format: FileFormat, schema: StructType): Unit = { +verifySchema(format, schema, isReadPath = true) + } + + /** + * Verify if the schema is supported in datasource. This verification should be done + * in a driver side, e.g., `prepareWrite`, `buildReader`, and `buildReaderWithPartitionValues` + * in `FileFormat`. + * + * Unsupported data types of csv, json, orc, and parquet are as follows; + * csv -> R/W: Interval, Null, Array, Map, Struct + * json -> W: Interval + * orc -> W: Interval, Null + * parquet -> R/W: Interval, Null + */ + private def verifySchema(format: FileFormat, schema: StructType, isReadPath: Boolean): Unit = { +def throwUnsupportedException(dataType: DataType): Unit = { + throw new UnsupportedOperationException( +s"$format data source does not support ${dataType.simpleString} data type.") +} + +def verifyType(dataType: DataType): Unit = dataType match { + case BooleanType | ByteType | ShortType | IntegerType | LongType | FloatType | DoubleType | + StringType | BinaryType | DateType | TimestampType | _: DecimalType => + + case _: StructType | _: ArrayType | _: MapType if format.isInstanceOf[CSVFileFormat] => +throwUnsupportedException(dataType) + + case st: StructType => st.foreach { f => verifyType(f.dataType) } + + case ArrayType(elementType, _) => verifyType(elementType) + + case MapType(keyType, valueType, _) => +verifyType(keyType) +verifyType(valueType) + + case _: CalendarIntervalType if isReadPath && format.isInstanceOf[JsonFileFormat] || +isReadPath && format.isInstanceOf[OrcFileFormat] => --- End diff -- yea, I'll recheck and list up these in the end. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21389: [SPARK-24204][SQL] Verify a schema in Json/Orc/Pa...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21389#discussion_r197626501 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala --- @@ -202,4 +204,222 @@ class FileBasedDataSourceSuite extends QueryTest with SharedSQLContext with Befo } } } + + // Unsupported data types of csv, json, orc, and parquet are as follows; + // csv -> R/W: Interval, Null, Array, Map, Struct + // json -> W: Interval + // orc -> W: Interval, Null + // parquet -> R/W: Interval, Null + test("SPARK-24204 error handling for unsupported data types") { --- End diff -- ok --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21389: [SPARK-24204][SQL] Verify a schema in Json/Orc/Pa...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21389#discussion_r197626306 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceUtils.scala --- @@ -0,0 +1,88 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import org.apache.spark.sql.execution.datasources.csv.CSVFileFormat +import org.apache.spark.sql.execution.datasources.json.JsonFileFormat +import org.apache.spark.sql.execution.datasources.orc.OrcFileFormat +import org.apache.spark.sql.types._ + + +object DataSourceUtils { + + /** + * Verify if the schema is supported in datasource in write path. + */ + def verifyWriteSchema(format: FileFormat, schema: StructType): Unit = { +verifySchema(format, schema, isReadPath = false) + } + + /** + * Verify if the schema is supported in datasource in read path. + */ + def verifyReadSchema(format: FileFormat, schema: StructType): Unit = { +verifySchema(format, schema, isReadPath = true) + } + + /** + * Verify if the schema is supported in datasource. This verification should be done + * in a driver side, e.g., `prepareWrite`, `buildReader`, and `buildReaderWithPartitionValues` + * in `FileFormat`. + * + * Unsupported data types of csv, json, orc, and parquet are as follows; + * csv -> R/W: Interval, Null, Array, Map, Struct + * json -> W: Interval + * orc -> W: Interval, Null + * parquet -> R/W: Interval, Null + */ + private def verifySchema(format: FileFormat, schema: StructType, isReadPath: Boolean): Unit = { +def throwUnsupportedException(dataType: DataType): Unit = { + throw new UnsupportedOperationException( +s"$format data source does not support ${dataType.simpleString} data type.") +} + +def verifyType(dataType: DataType): Unit = dataType match { + case BooleanType | ByteType | ShortType | IntegerType | LongType | FloatType | DoubleType | + StringType | BinaryType | DateType | TimestampType | _: DecimalType => + + case _: StructType | _: ArrayType | _: MapType if format.isInstanceOf[CSVFileFormat] => +throwUnsupportedException(dataType) + + case st: StructType => st.foreach { f => verifyType(f.dataType) } + + case ArrayType(elementType, _) => verifyType(elementType) + + case MapType(keyType, valueType, _) => +verifyType(keyType) +verifyType(valueType) + + case _: CalendarIntervalType if isReadPath && format.isInstanceOf[JsonFileFormat] || +isReadPath && format.isInstanceOf[OrcFileFormat] => --- End diff -- Let us try to enumerate all the ones we want to block? Instead of relying on the last case to throw the exceptions? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21389: [SPARK-24204][SQL] Verify a schema in Json/Orc/Pa...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21389#discussion_r197626232 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala --- @@ -202,4 +204,222 @@ class FileBasedDataSourceSuite extends QueryTest with SharedSQLContext with Befo } } } + + // Unsupported data types of csv, json, orc, and parquet are as follows; + // csv -> R/W: Interval, Null, Array, Map, Struct + // json -> W: Interval + // orc -> W: Interval, Null + // parquet -> R/W: Interval, Null + test("SPARK-24204 error handling for unsupported data types") { --- End diff -- Can we split this test case to multiple smaller ones? It will be easy to understand what happens if any of them fail by a meaningful test case name. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21389: [SPARK-24204][SQL] Verify a schema in Json/Orc/Pa...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21389#discussion_r197626168 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceUtils.scala --- @@ -0,0 +1,88 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import org.apache.spark.sql.execution.datasources.csv.CSVFileFormat +import org.apache.spark.sql.execution.datasources.json.JsonFileFormat +import org.apache.spark.sql.execution.datasources.orc.OrcFileFormat +import org.apache.spark.sql.types._ + + +object DataSourceUtils { + + /** + * Verify if the schema is supported in datasource in write path. + */ + def verifyWriteSchema(format: FileFormat, schema: StructType): Unit = { +verifySchema(format, schema, isReadPath = false) + } + + /** + * Verify if the schema is supported in datasource in read path. + */ + def verifyReadSchema(format: FileFormat, schema: StructType): Unit = { +verifySchema(format, schema, isReadPath = true) + } + + /** + * Verify if the schema is supported in datasource. This verification should be done + * in a driver side, e.g., `prepareWrite`, `buildReader`, and `buildReaderWithPartitionValues` + * in `FileFormat`. + * + * Unsupported data types of csv, json, orc, and parquet are as follows; + * csv -> R/W: Interval, Null, Array, Map, Struct + * json -> W: Interval + * orc -> W: Interval, Null + * parquet -> R/W: Interval, Null + */ + private def verifySchema(format: FileFormat, schema: StructType, isReadPath: Boolean): Unit = { +def throwUnsupportedException(dataType: DataType): Unit = { + throw new UnsupportedOperationException( +s"$format data source does not support ${dataType.simpleString} data type.") +} + +def verifyType(dataType: DataType): Unit = dataType match { + case BooleanType | ByteType | ShortType | IntegerType | LongType | FloatType | DoubleType | + StringType | BinaryType | DateType | TimestampType | _: DecimalType => + + case _: StructType | _: ArrayType | _: MapType if format.isInstanceOf[CSVFileFormat] => +throwUnsupportedException(dataType) + + case st: StructType => st.foreach { f => verifyType(f.dataType) } + + case ArrayType(elementType, _) => verifyType(elementType) + + case MapType(keyType, valueType, _) => +verifyType(keyType) +verifyType(valueType) + + case _: CalendarIntervalType if isReadPath && format.isInstanceOf[JsonFileFormat] || +isReadPath && format.isInstanceOf[OrcFileFormat] => + + case udt: UserDefinedType[_] => verifyType(udt.sqlType) + + // For JSON backward-compatibility + case NullType if format.isInstanceOf[JsonFileFormat] || +(isReadPath && format.isInstanceOf[OrcFileFormat]) => + + case _ => throwUnsupportedException(dataType) --- End diff -- ok --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21389: [SPARK-24204][SQL] Verify a schema in Json/Orc/Pa...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21389#discussion_r197626164 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceUtils.scala --- @@ -0,0 +1,88 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import org.apache.spark.sql.execution.datasources.csv.CSVFileFormat +import org.apache.spark.sql.execution.datasources.json.JsonFileFormat +import org.apache.spark.sql.execution.datasources.orc.OrcFileFormat +import org.apache.spark.sql.types._ + + +object DataSourceUtils { + + /** + * Verify if the schema is supported in datasource in write path. + */ + def verifyWriteSchema(format: FileFormat, schema: StructType): Unit = { +verifySchema(format, schema, isReadPath = false) + } + + /** + * Verify if the schema is supported in datasource in read path. + */ + def verifyReadSchema(format: FileFormat, schema: StructType): Unit = { +verifySchema(format, schema, isReadPath = true) + } + + /** + * Verify if the schema is supported in datasource. This verification should be done + * in a driver side, e.g., `prepareWrite`, `buildReader`, and `buildReaderWithPartitionValues` + * in `FileFormat`. + * + * Unsupported data types of csv, json, orc, and parquet are as follows; + * csv -> R/W: Interval, Null, Array, Map, Struct + * json -> W: Interval + * orc -> W: Interval, Null + * parquet -> R/W: Interval, Null + */ + private def verifySchema(format: FileFormat, schema: StructType, isReadPath: Boolean): Unit = { +def throwUnsupportedException(dataType: DataType): Unit = { + throw new UnsupportedOperationException( +s"$format data source does not support ${dataType.simpleString} data type.") +} + +def verifyType(dataType: DataType): Unit = dataType match { + case BooleanType | ByteType | ShortType | IntegerType | LongType | FloatType | DoubleType | + StringType | BinaryType | DateType | TimestampType | _: DecimalType => + + case _: StructType | _: ArrayType | _: MapType if format.isInstanceOf[CSVFileFormat] => +throwUnsupportedException(dataType) + + case st: StructType => st.foreach { f => verifyType(f.dataType) } + + case ArrayType(elementType, _) => verifyType(elementType) + + case MapType(keyType, valueType, _) => +verifyType(keyType) +verifyType(valueType) + + case _: CalendarIntervalType if isReadPath && format.isInstanceOf[JsonFileFormat] || +isReadPath && format.isInstanceOf[OrcFileFormat] => --- End diff -- ok, will do --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21389: [SPARK-24204][SQL] Verify a schema in Json/Orc/Pa...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21389#discussion_r197626154 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceUtils.scala --- @@ -0,0 +1,88 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import org.apache.spark.sql.execution.datasources.csv.CSVFileFormat +import org.apache.spark.sql.execution.datasources.json.JsonFileFormat +import org.apache.spark.sql.execution.datasources.orc.OrcFileFormat +import org.apache.spark.sql.types._ + + +object DataSourceUtils { + + /** + * Verify if the schema is supported in datasource in write path. + */ + def verifyWriteSchema(format: FileFormat, schema: StructType): Unit = { +verifySchema(format, schema, isReadPath = false) + } + + /** + * Verify if the schema is supported in datasource in read path. + */ + def verifyReadSchema(format: FileFormat, schema: StructType): Unit = { +verifySchema(format, schema, isReadPath = true) + } + + /** + * Verify if the schema is supported in datasource. This verification should be done + * in a driver side, e.g., `prepareWrite`, `buildReader`, and `buildReaderWithPartitionValues` + * in `FileFormat`. + * + * Unsupported data types of csv, json, orc, and parquet are as follows; + * csv -> R/W: Interval, Null, Array, Map, Struct + * json -> W: Interval + * orc -> W: Interval, Null + * parquet -> R/W: Interval, Null + */ + private def verifySchema(format: FileFormat, schema: StructType, isReadPath: Boolean): Unit = { +def throwUnsupportedException(dataType: DataType): Unit = { + throw new UnsupportedOperationException( +s"$format data source does not support ${dataType.simpleString} data type.") +} + +def verifyType(dataType: DataType): Unit = dataType match { + case BooleanType | ByteType | ShortType | IntegerType | LongType | FloatType | DoubleType | + StringType | BinaryType | DateType | TimestampType | _: DecimalType => + + case _: StructType | _: ArrayType | _: MapType if format.isInstanceOf[CSVFileFormat] => +throwUnsupportedException(dataType) + + case st: StructType => st.foreach { f => verifyType(f.dataType) } + + case ArrayType(elementType, _) => verifyType(elementType) + + case MapType(keyType, valueType, _) => +verifyType(keyType) +verifyType(valueType) + + case _: CalendarIntervalType if isReadPath && format.isInstanceOf[JsonFileFormat] || +isReadPath && format.isInstanceOf[OrcFileFormat] => + + case udt: UserDefinedType[_] => verifyType(udt.sqlType) + + // For JSON backward-compatibility + case NullType if format.isInstanceOf[JsonFileFormat] || +(isReadPath && format.isInstanceOf[OrcFileFormat]) => + + case _ => throwUnsupportedException(dataType) --- End diff -- Write a comment above this? > // Actually we won't pass in unsupported data types, this is a safety check. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21389: [SPARK-24204][SQL] Verify a schema in Json/Orc/Pa...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21389#discussion_r197626122 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceUtils.scala --- @@ -0,0 +1,88 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import org.apache.spark.sql.execution.datasources.csv.CSVFileFormat +import org.apache.spark.sql.execution.datasources.json.JsonFileFormat +import org.apache.spark.sql.execution.datasources.orc.OrcFileFormat +import org.apache.spark.sql.types._ + + +object DataSourceUtils { + + /** + * Verify if the schema is supported in datasource in write path. + */ + def verifyWriteSchema(format: FileFormat, schema: StructType): Unit = { +verifySchema(format, schema, isReadPath = false) + } + + /** + * Verify if the schema is supported in datasource in read path. + */ + def verifyReadSchema(format: FileFormat, schema: StructType): Unit = { +verifySchema(format, schema, isReadPath = true) + } + + /** + * Verify if the schema is supported in datasource. This verification should be done + * in a driver side, e.g., `prepareWrite`, `buildReader`, and `buildReaderWithPartitionValues` + * in `FileFormat`. + * + * Unsupported data types of csv, json, orc, and parquet are as follows; + * csv -> R/W: Interval, Null, Array, Map, Struct + * json -> W: Interval + * orc -> W: Interval, Null + * parquet -> R/W: Interval, Null + */ + private def verifySchema(format: FileFormat, schema: StructType, isReadPath: Boolean): Unit = { +def throwUnsupportedException(dataType: DataType): Unit = { + throw new UnsupportedOperationException( +s"$format data source does not support ${dataType.simpleString} data type.") +} + +def verifyType(dataType: DataType): Unit = dataType match { + case BooleanType | ByteType | ShortType | IntegerType | LongType | FloatType | DoubleType | + StringType | BinaryType | DateType | TimestampType | _: DecimalType => + + case _: StructType | _: ArrayType | _: MapType if format.isInstanceOf[CSVFileFormat] => +throwUnsupportedException(dataType) + + case st: StructType => st.foreach { f => verifyType(f.dataType) } + + case ArrayType(elementType, _) => verifyType(elementType) + + case MapType(keyType, valueType, _) => +verifyType(keyType) +verifyType(valueType) + + case _: CalendarIntervalType if isReadPath && format.isInstanceOf[JsonFileFormat] || +isReadPath && format.isInstanceOf[OrcFileFormat] => --- End diff -- simplify the condition? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21389: [SPARK-24204][SQL] Verify a schema in Json/Orc/Pa...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21389#discussion_r195579915 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceUtils.scala --- @@ -0,0 +1,61 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import org.apache.spark.sql.execution.datasources.csv.CSVFileFormat +import org.apache.spark.sql.execution.datasources.json.JsonFileFormat +import org.apache.spark.sql.types._ + + +object DataSourceUtils { + + /** + * Verify if the schema is supported in datasource. --- End diff -- ok --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21389: [SPARK-24204][SQL] Verify a schema in Json/Orc/Pa...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21389#discussion_r195472992 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceUtils.scala --- @@ -0,0 +1,61 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import org.apache.spark.sql.execution.datasources.csv.CSVFileFormat +import org.apache.spark.sql.execution.datasources.json.JsonFileFormat +import org.apache.spark.sql.types._ + + +object DataSourceUtils { + + /** + * Verify if the schema is supported in datasource. --- End diff -- Please improve the description? and document which built-in file formats are covered by this function. Also document which data types are not supported for each data source? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21389: [SPARK-24204][SQL] Verify a schema in Json/Orc/Pa...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21389#discussion_r190791115 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceUtils.scala --- @@ -0,0 +1,53 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import org.apache.spark.sql.types._ + + +object DataSourceUtils { + + /** + * Verify if the schema is supported in datasource. + */ + def verifySchema(format: String, schema: StructType): Unit = { +def verifyType(dataType: DataType): Unit = dataType match { + case BooleanType | ByteType | ShortType | IntegerType | LongType | FloatType | DoubleType | + StringType | BinaryType | DateType | TimestampType | _: DecimalType => + + case st: StructType => st.foreach { f => verifyType(f.dataType) } + + case ArrayType(elementType, _) => verifyType(elementType) + + case MapType(keyType, valueType, _) => +verifyType(keyType) +verifyType(valueType) + + case udt: UserDefinedType[_] => verifyType(udt.sqlType) + + // For backward-compatibility --- End diff -- Yes, as long as it does not break anything. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21389: [SPARK-24204][SQL] Verify a schema in Json/Orc/Pa...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21389#discussion_r190461717 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceUtils.scala --- @@ -0,0 +1,53 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import org.apache.spark.sql.types._ + + +object DataSourceUtils { + + /** + * Verify if the schema is supported in datasource. + */ + def verifySchema(format: String, schema: StructType): Unit = { +def verifyType(dataType: DataType): Unit = dataType match { + case BooleanType | ByteType | ShortType | IntegerType | LongType | FloatType | DoubleType | + StringType | BinaryType | DateType | TimestampType | _: DecimalType => + + case st: StructType => st.foreach { f => verifyType(f.dataType) } + + case ArrayType(elementType, _) => verifyType(elementType) + + case MapType(keyType, valueType, _) => +verifyType(keyType) +verifyType(valueType) + + case udt: UserDefinedType[_] => verifyType(udt.sqlType) + + // For backward-compatibility + case NullType if format == "JSON" => + + case _ => +throw new UnsupportedOperationException( --- End diff -- ok --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21389: [SPARK-24204][SQL] Verify a schema in Json/Orc/Pa...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21389#discussion_r190461628 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceUtils.scala --- @@ -0,0 +1,53 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import org.apache.spark.sql.types._ + + +object DataSourceUtils { + + /** + * Verify if the schema is supported in datasource. + */ + def verifySchema(format: String, schema: StructType): Unit = { +def verifyType(dataType: DataType): Unit = dataType match { + case BooleanType | ByteType | ShortType | IntegerType | LongType | FloatType | DoubleType | + StringType | BinaryType | DateType | TimestampType | _: DecimalType => + + case st: StructType => st.foreach { f => verifyType(f.dataType) } + + case ArrayType(elementType, _) => verifyType(elementType) + + case MapType(keyType, valueType, _) => +verifyType(keyType) +verifyType(valueType) + + case udt: UserDefinedType[_] => verifyType(udt.sqlType) + + // For backward-compatibility --- End diff -- ok, I will. Also, we need to merge this function with `CSVUtils.verifySchema` in this pr? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21389: [SPARK-24204][SQL] Verify a schema in Json/Orc/Pa...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21389#discussion_r190461517 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceUtils.scala --- @@ -0,0 +1,53 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import org.apache.spark.sql.types._ + + +object DataSourceUtils { + + /** + * Verify if the schema is supported in datasource. + */ + def verifySchema(format: String, schema: StructType): Unit = { +def verifyType(dataType: DataType): Unit = dataType match { + case BooleanType | ByteType | ShortType | IntegerType | LongType | FloatType | DoubleType | + StringType | BinaryType | DateType | TimestampType | _: DecimalType => + + case st: StructType => st.foreach { f => verifyType(f.dataType) } + + case ArrayType(elementType, _) => verifyType(elementType) + + case MapType(keyType, valueType, _) => +verifyType(keyType) +verifyType(valueType) + + case udt: UserDefinedType[_] => verifyType(udt.sqlType) + + // For backward-compatibility + case NullType if format == "JSON" => + + case _ => +throw new UnsupportedOperationException( --- End diff -- Basically, for such a PR, we need to check all the data types that we block and ensure no behavior change is introduced by this PR. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21389: [SPARK-24204][SQL] Verify a schema in Json/Orc/Pa...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21389#discussion_r190461402 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceUtils.scala --- @@ -0,0 +1,53 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import org.apache.spark.sql.types._ + + +object DataSourceUtils { + + /** + * Verify if the schema is supported in datasource. + */ + def verifySchema(format: String, schema: StructType): Unit = { +def verifyType(dataType: DataType): Unit = dataType match { + case BooleanType | ByteType | ShortType | IntegerType | LongType | FloatType | DoubleType | + StringType | BinaryType | DateType | TimestampType | _: DecimalType => + + case st: StructType => st.foreach { f => verifyType(f.dataType) } + + case ArrayType(elementType, _) => verifyType(elementType) + + case MapType(keyType, valueType, _) => +verifyType(keyType) +verifyType(valueType) + + case udt: UserDefinedType[_] => verifyType(udt.sqlType) + + // For backward-compatibility --- End diff -- Do we have any test case for this? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21389: [SPARK-24204][SQL] Verify a schema in Json/Orc/Pa...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21389#discussion_r190136535 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala --- @@ -89,6 +89,8 @@ class OrcFileFormat job: Job, options: Map[String, String], dataSchema: StructType): OutputWriterFactory = { +DataSourceUtils.verifySchema("ORC", dataSchema) --- End diff -- yea, that's smart. I will update. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21389: [SPARK-24204][SQL] Verify a schema in Json/Orc/Pa...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/21389#discussion_r190130347 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala --- @@ -89,6 +89,8 @@ class OrcFileFormat job: Job, options: Map[String, String], dataSchema: StructType): OutputWriterFactory = { +DataSourceUtils.verifySchema("ORC", dataSchema) --- End diff -- Thank you for refactoring the PR, @maropu ! What about using `shortName` instead of string literal "ORC" here? Then, we can have the same line like the following. ``` DataSourceUtils.verifySchema(shortName, dataSchema) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21389: [SPARK-24204][SQL] Verify a schema in Json/Orc/Pa...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21389#discussion_r189964737 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala --- @@ -2408,4 +2409,53 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData { spark.read.option("mode", "PERMISSIVE").option("encoding", "UTF-8").json(Seq(badJson).toDS()), Row(badJson)) } + + test("SPARK-24204 error handling for unsupported data types") { --- End diff -- ok, I'll brush up the tests. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21389: [SPARK-24204][SQL] Verify a schema in Json/Orc/Pa...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/21389#discussion_r189964186 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala --- @@ -2408,4 +2409,53 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData { spark.read.option("mode", "PERMISSIVE").option("encoding", "UTF-8").json(Seq(badJson).toDS()), Row(badJson)) } + + test("SPARK-24204 error handling for unsupported data types") { --- End diff -- The suite doesn't assume that all file-based data source has the same capability. In this PR, the test codes are almost the same and the only difference are the mapping tables. For example, - `json` -> Interval - `orc` -> Interval, Null - `parquet` -> Interval, Null --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21389: [SPARK-24204][SQL] Verify a schema in Json/Orc/Pa...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21389#discussion_r189961194 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala --- @@ -2408,4 +2409,53 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData { spark.read.option("mode", "PERMISSIVE").option("encoding", "UTF-8").json(Seq(badJson).toDS()), Row(badJson)) } + + test("SPARK-24204 error handling for unsupported data types") { --- End diff -- (Probably, the suggestion is related to the one above) Essentially, the supported types are specific to datasource implementations, so I'm not 100% sure that it's the best to put these tests in `FileBasedDataSourceSuite `. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21389: [SPARK-24204][SQL] Verify a schema in Json/Orc/Pa...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/21389#discussion_r189952391 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala --- @@ -2408,4 +2409,53 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData { spark.read.option("mode", "PERMISSIVE").option("encoding", "UTF-8").json(Seq(badJson).toDS()), Row(badJson)) } + + test("SPARK-24204 error handling for unsupported data types") { --- End diff -- Thank you for pinging me, @maropu . Since this is all about file-based data sources, can we have all these test cases in `FileBasedDataSourceSuite`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21389: [SPARK-24204][SQL] Verify a schema in Json/Orc/Pa...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21389#discussion_r189924257 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonUtils.scala --- @@ -48,4 +49,33 @@ object JsonUtils { json.sample(withReplacement = false, options.samplingRatio, 1) } } + + /** + * Verify if the schema is supported in JSON datasource. + */ + def verifySchema(schema: StructType): Unit = { --- End diff -- Since supported types are specific to data sources, I think we need to verify a schema in each file format implementations. But, yes these built-in format (orc and parquet) has the same supported types, so it might be better to move the code `veryfySchema` into somewhere (e.g., `DataSourceUtils` or something) for avoiding code duplication --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21389: [SPARK-24204][SQL] Verify a schema in Json/Orc/Pa...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21389#discussion_r189913422 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonUtils.scala --- @@ -48,4 +49,33 @@ object JsonUtils { json.sample(withReplacement = false, options.samplingRatio, 1) } } + + /** + * Verify if the schema is supported in JSON datasource. + */ + def verifySchema(schema: StructType): Unit = { --- End diff -- Hmm .. but wouldn't the supported types be very specific to data source? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21389: [SPARK-24204][SQL] Verify a schema in Json/Orc/Pa...
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/21389#discussion_r189872259 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonUtils.scala --- @@ -48,4 +49,33 @@ object JsonUtils { json.sample(withReplacement = false, options.samplingRatio, 1) } } + + /** + * Verify if the schema is supported in JSON datasource. + */ + def verifySchema(schema: StructType): Unit = { --- End diff -- The function `verifySchema` is very similar with the one in Orc/Parquet except the exception message. Should we put it into a util object? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21389: [SPARK-24204][SQL] Verify a schema in Json/Orc/Pa...
GitHub user maropu opened a pull request: https://github.com/apache/spark/pull/21389 [SPARK-24204][SQL] Verify a schema in Json/Orc/ParquetFileFormat ## What changes were proposed in this pull request? This pr added code to verify a schema in Json/Orc/ParquetFileFormat along with CSVFileFormat. ## How was this patch tested? Added tests in `OrcSourceSuite`, `ParquetQuerySuite`, and `JsonSuite`. You can merge this pull request into a Git repository by running: $ git pull https://github.com/maropu/spark SPARK-24204 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21389.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21389 commit 0d88bcb58f9298bed433b8febc4c9cfb5d92f6a9 Author: Takeshi YamamuroDate: 2018-05-08T01:55:26Z Fix --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org