subject:"\[GitHub\] spark pull request #21389\: \[SPARK\-24204\]\[SQL\] Verify a schema in Json\/Orc\/Pa..."

[GitHub] spark pull request #21389: [SPARK-24204][SQL] Verify a schema in Json/Orc/Pa...

2018-06-27 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/21389


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21389: [SPARK-24204][SQL] Verify a schema in Json/Orc/Pa...

2018-06-26 Thread maropu

Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/21389#discussion_r198355457
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala ---
@@ -202,4 +204,230 @@ class FileBasedDataSourceSuite extends QueryTest with 
SharedSQLContext with Befo
   }
 }
   }
+
+  // Unsupported data types of csv, json, orc, and parquet are as follows;
+  //  csv -> R/W: Interval, Null, Array, Map, Struct
+  //  json -> W: Interval
+  //  orc -> W: Interval, Null
+  //  parquet -> R/W: Interval, Null
+  test("SPARK-24204 error handling for unsupported Array/Map/Struct types 
- csv") {
+withTempDir { dir =>
+  val csvDir = new File(dir, "csv").getCanonicalPath
--- End diff --

Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21389: [SPARK-24204][SQL] Verify a schema in Json/Orc/Pa...

2018-06-26 Thread maropu

Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/21389#discussion_r198355370
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceUtils.scala
 ---
@@ -0,0 +1,101 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources
+
+import org.apache.spark.sql.execution.datasources.csv.CSVFileFormat
+import org.apache.spark.sql.execution.datasources.json.JsonFileFormat
+import org.apache.spark.sql.execution.datasources.orc.OrcFileFormat
+import org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat
+import org.apache.spark.sql.types._
+
+
+object DataSourceUtils {
+
+  /**
+   * Verify if the schema is supported in datasource in write path.
+   */
+  def verifyWriteSchema(format: FileFormat, schema: StructType): Unit = {
+verifySchema(format, schema, isReadPath = false)
+  }
+
+  /**
+   * Verify if the schema is supported in datasource in read path.
+   */
+  def verifyReadSchema(format: FileFormat, schema: StructType): Unit = {
+verifySchema(format, schema, isReadPath = true)
+  }
+
+  /**
+   * Verify if the schema is supported in datasource. This verification 
should be done
+   * in a driver side, e.g., `prepareWrite`, `buildReader`, and 
`buildReaderWithPartitionValues`
+   * in `FileFormat`.
+   *
+   * Unsupported data types of csv, json, orc, and parquet are as follows;
+   *  csv -> R/W: Interval, Null, Array, Map, Struct
+   *  json -> W: Interval
+   *  orc -> W: Interval, Null
+   *  parquet -> R/W: Interval, Null
+   */
+  private def verifySchema(format: FileFormat, schema: StructType, 
isReadPath: Boolean): Unit = {
+def throwUnsupportedException(dataType: DataType): Unit = {
+  throw new UnsupportedOperationException(
+s"$format data source does not support ${dataType.simpleString} 
data type.")
+}
+
+def verifyType(dataType: DataType): Unit = dataType match {
+  case BooleanType | ByteType | ShortType | IntegerType | LongType | 
FloatType | DoubleType |
+   StringType | BinaryType | DateType | TimestampType | _: 
DecimalType =>
+
+  case _: CalendarIntervalType | _: StructType | _: ArrayType | _: 
MapType
+  if format.isInstanceOf[CSVFileFormat] =>
+throwUnsupportedException(dataType)
+
+  case st: StructType => st.foreach { f => verifyType(f.dataType) }
+
+  case ArrayType(elementType, _) => verifyType(elementType)
+
+  case MapType(keyType, valueType, _) =>
+verifyType(keyType)
+verifyType(valueType)
+
+  // JSON and ORC don't support an Interval type, but we pass it in 
read pass
+  // for back-compatibility.
+  case _: CalendarIntervalType if isReadPath &&
--- End diff --

Thanks for the comment. Based on the suggestion, I modified code. Could you 
check again?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21389: [SPARK-24204][SQL] Verify a schema in Json/Orc/Pa...

2018-06-26 Thread gengliangwang

Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/21389#discussion_r198351604
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala ---
@@ -202,4 +204,230 @@ class FileBasedDataSourceSuite extends QueryTest with 
SharedSQLContext with Befo
   }
 }
   }
+
+  // Unsupported data types of csv, json, orc, and parquet are as follows;
+  //  csv -> R/W: Interval, Null, Array, Map, Struct
+  //  json -> W: Interval
+  //  orc -> W: Interval, Null
+  //  parquet -> R/W: Interval, Null
+  test("SPARK-24204 error handling for unsupported Array/Map/Struct types 
- csv") {
+withTempDir { dir =>
+  val csvDir = new File(dir, "csv").getCanonicalPath
--- End diff --

No, you are right. I misunderstood.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21389: [SPARK-24204][SQL] Verify a schema in Json/Orc/Pa...

2018-06-26 Thread maropu

Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/21389#discussion_r198350214
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala ---
@@ -202,4 +204,230 @@ class FileBasedDataSourceSuite extends QueryTest with 
SharedSQLContext with Befo
   }
 }
   }
+
+  // Unsupported data types of csv, json, orc, and parquet are as follows;
+  //  csv -> R/W: Interval, Null, Array, Map, Struct
+  //  json -> W: Interval
+  //  orc -> W: Interval, Null
+  //  parquet -> R/W: Interval, Null
+  test("SPARK-24204 error handling for unsupported Array/Map/Struct types 
- csv") {
+withTempDir { dir =>
+  val csvDir = new File(dir, "csv").getCanonicalPath
+  var msg = intercept[UnsupportedOperationException] {
+Seq((1, "Tesla")).toDF("a", "b").selectExpr("struct(a, 
b)").write.csv(csvDir)
+  }.getMessage
+  assert(msg.contains("CSV data source does not support 
struct data type"))
+
+  msg = intercept[UnsupportedOperationException] {
+val schema = StructType.fromDDL("a struct")
+spark.range(1).write.mode("overwrite").csv(csvDir)
+spark.read.schema(schema).csv(csvDir).collect()
+  }.getMessage
+  assert(msg.contains("CSV data source does not support struct 
data type"))
+
+  msg = intercept[UnsupportedOperationException] {
+Seq((1, Map("Tesla" -> 3))).toDF("id", 
"cars").write.mode("overwrite").csv(csvDir)
+  }.getMessage
+  assert(msg.contains("CSV data source does not support 
map data type"))
+
+  msg = intercept[UnsupportedOperationException] {
+val schema = StructType.fromDDL("a map")
+spark.range(1).write.mode("overwrite").csv(csvDir)
+spark.read.schema(schema).csv(csvDir).collect()
+  }.getMessage
+  assert(msg.contains("CSV data source does not support map 
data type"))
+
+  msg = intercept[UnsupportedOperationException] {
+Seq((1, Array("Tesla", "Chevy", "Ford"))).toDF("id", "brands")
+  .write.mode("overwrite").csv(csvDir)
+  }.getMessage
+  assert(msg.contains("CSV data source does not support array 
data type"))
+
+  msg = intercept[UnsupportedOperationException] {
+ val schema = StructType.fromDDL("a array")
+ spark.range(1).write.mode("overwrite").csv(csvDir)
+ spark.read.schema(schema).csv(csvDir).collect()
+   }.getMessage
+  assert(msg.contains("CSV data source does not support array 
data type"))
+
+  msg = intercept[UnsupportedOperationException] {
+Seq((1, new UDT.MyDenseVector(Array(0.25, 2.25, 
4.25.toDF("id", "vectors")
+  .write.mode("overwrite").csv(csvDir)
+  }.getMessage
+  assert(msg.contains("CSV data source does not support array 
data type"))
+
+  msg = intercept[UnsupportedOperationException] {
+val schema = StructType(StructField("a", new 
UDT.MyDenseVectorUDT(), true) :: Nil)
+spark.range(1).write.mode("overwrite").csv(csvDir)
+spark.read.schema(schema).csv(csvDir).collect()
+  }.getMessage
+  assert(msg.contains("CSV data source does not support array 
data type."))
+}
+  }
+
+  test("SPARK-24204 error handling for unsupported Interval data types - 
csv, json, parquet, orc") {
+withTempDir { dir =>
+  val tempDir = new File(dir, "files").getCanonicalPath
+
+   Seq("orc", "json").foreach { format =>
--- End diff --

fixed


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21389: [SPARK-24204][SQL] Verify a schema in Json/Orc/Pa...

2018-06-26 Thread maropu

Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/21389#discussion_r198350225
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala ---
@@ -202,4 +204,230 @@ class FileBasedDataSourceSuite extends QueryTest with 
SharedSQLContext with Befo
   }
 }
   }
+
+  // Unsupported data types of csv, json, orc, and parquet are as follows;
+  //  csv -> R/W: Interval, Null, Array, Map, Struct
+  //  json -> W: Interval
+  //  orc -> W: Interval, Null
+  //  parquet -> R/W: Interval, Null
+  test("SPARK-24204 error handling for unsupported Array/Map/Struct types 
- csv") {
+withTempDir { dir =>
+  val csvDir = new File(dir, "csv").getCanonicalPath
+  var msg = intercept[UnsupportedOperationException] {
+Seq((1, "Tesla")).toDF("a", "b").selectExpr("struct(a, 
b)").write.csv(csvDir)
+  }.getMessage
+  assert(msg.contains("CSV data source does not support 
struct data type"))
+
+  msg = intercept[UnsupportedOperationException] {
+val schema = StructType.fromDDL("a struct")
+spark.range(1).write.mode("overwrite").csv(csvDir)
+spark.read.schema(schema).csv(csvDir).collect()
+  }.getMessage
+  assert(msg.contains("CSV data source does not support struct 
data type"))
+
+  msg = intercept[UnsupportedOperationException] {
+Seq((1, Map("Tesla" -> 3))).toDF("id", 
"cars").write.mode("overwrite").csv(csvDir)
+  }.getMessage
+  assert(msg.contains("CSV data source does not support 
map data type"))
+
+  msg = intercept[UnsupportedOperationException] {
+val schema = StructType.fromDDL("a map")
+spark.range(1).write.mode("overwrite").csv(csvDir)
+spark.read.schema(schema).csv(csvDir).collect()
+  }.getMessage
+  assert(msg.contains("CSV data source does not support map 
data type"))
+
+  msg = intercept[UnsupportedOperationException] {
+Seq((1, Array("Tesla", "Chevy", "Ford"))).toDF("id", "brands")
+  .write.mode("overwrite").csv(csvDir)
+  }.getMessage
+  assert(msg.contains("CSV data source does not support array 
data type"))
+
+  msg = intercept[UnsupportedOperationException] {
+ val schema = StructType.fromDDL("a array")
+ spark.range(1).write.mode("overwrite").csv(csvDir)
+ spark.read.schema(schema).csv(csvDir).collect()
+   }.getMessage
+  assert(msg.contains("CSV data source does not support array 
data type"))
+
+  msg = intercept[UnsupportedOperationException] {
+Seq((1, new UDT.MyDenseVector(Array(0.25, 2.25, 
4.25.toDF("id", "vectors")
+  .write.mode("overwrite").csv(csvDir)
+  }.getMessage
+  assert(msg.contains("CSV data source does not support array 
data type"))
+
+  msg = intercept[UnsupportedOperationException] {
+val schema = StructType(StructField("a", new 
UDT.MyDenseVectorUDT(), true) :: Nil)
+spark.range(1).write.mode("overwrite").csv(csvDir)
+spark.read.schema(schema).csv(csvDir).collect()
+  }.getMessage
+  assert(msg.contains("CSV data source does not support array 
data type."))
+}
+  }
+
+  test("SPARK-24204 error handling for unsupported Interval data types - 
csv, json, parquet, orc") {
+withTempDir { dir =>
+  val tempDir = new File(dir, "files").getCanonicalPath
+
+   Seq("orc", "json").foreach { format =>
+// write path
+var msg = intercept[AnalysisException] {
+  sql("select interval 1 
days").write.format(format).mode("overwrite").save(tempDir)
+}.getMessage
+assert(msg.contains("Cannot save interval data type into external 
storage."))
+
+msg = intercept[UnsupportedOperationException] {
+  spark.udf.register("testType", () => new IntervalData())
+  sql("select 
testType()").write.format(format).mode("overwrite").save(tempDir)
+}.getMessage
+assert(msg.toLowerCase(Locale.ROOT)
+  .contains(s"$format data source does not support 
calendarinterval data type."))
+
+// read path
+// We expect the types below should be passed for 
backward-compatibility
+
+// Interval type
+var schema = StructType(StructField("a", CalendarIntervalType, 
true) :: Nil)
+spark.range(1).write.format(format).mode("overwrite").save(tempDir)
+spark.read.schema(schema).format(format).load(tempDir).collect()
+
+// UDT having interval data
+schema = StructType(StructField("a", new IntervalUDT(), true) :: 
Nil)
+

[GitHub] spark pull request #21389: [SPARK-24204][SQL] Verify a schema in Json/Orc/Pa...

2018-06-26 Thread maropu

Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/21389#discussion_r198348474
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala ---
@@ -202,4 +204,230 @@ class FileBasedDataSourceSuite extends QueryTest with 
SharedSQLContext with Befo
   }
 }
   }
+
+  // Unsupported data types of csv, json, orc, and parquet are as follows;
+  //  csv -> R/W: Interval, Null, Array, Map, Struct
+  //  json -> W: Interval
+  //  orc -> W: Interval, Null
+  //  parquet -> R/W: Interval, Null
+  test("SPARK-24204 error handling for unsupported Array/Map/Struct types 
- csv") {
+withTempDir { dir =>
+  val csvDir = new File(dir, "csv").getCanonicalPath
--- End diff --

This tests are moved from `CSVSuite`: 
https://github.com/apache/spark/pull/21389/files#diff-219ac8201e443435499123f96e94d29fL743

Do I misunderstand your comment?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21389: [SPARK-24204][SQL] Verify a schema in Json/Orc/Pa...

2018-06-26 Thread gengliangwang

Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/21389#discussion_r198342819
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala ---
@@ -202,4 +204,230 @@ class FileBasedDataSourceSuite extends QueryTest with 
SharedSQLContext with Befo
   }
 }
   }
+
+  // Unsupported data types of csv, json, orc, and parquet are as follows;
+  //  csv -> R/W: Interval, Null, Array, Map, Struct
+  //  json -> W: Interval
+  //  orc -> W: Interval, Null
+  //  parquet -> R/W: Interval, Null
+  test("SPARK-24204 error handling for unsupported Array/Map/Struct types 
- csv") {
+withTempDir { dir =>
+  val csvDir = new File(dir, "csv").getCanonicalPath
+  var msg = intercept[UnsupportedOperationException] {
+Seq((1, "Tesla")).toDF("a", "b").selectExpr("struct(a, 
b)").write.csv(csvDir)
+  }.getMessage
+  assert(msg.contains("CSV data source does not support 
struct data type"))
+
+  msg = intercept[UnsupportedOperationException] {
+val schema = StructType.fromDDL("a struct")
+spark.range(1).write.mode("overwrite").csv(csvDir)
+spark.read.schema(schema).csv(csvDir).collect()
+  }.getMessage
+  assert(msg.contains("CSV data source does not support struct 
data type"))
+
+  msg = intercept[UnsupportedOperationException] {
+Seq((1, Map("Tesla" -> 3))).toDF("id", 
"cars").write.mode("overwrite").csv(csvDir)
+  }.getMessage
+  assert(msg.contains("CSV data source does not support 
map data type"))
+
+  msg = intercept[UnsupportedOperationException] {
+val schema = StructType.fromDDL("a map")
+spark.range(1).write.mode("overwrite").csv(csvDir)
+spark.read.schema(schema).csv(csvDir).collect()
+  }.getMessage
+  assert(msg.contains("CSV data source does not support map 
data type"))
+
+  msg = intercept[UnsupportedOperationException] {
+Seq((1, Array("Tesla", "Chevy", "Ford"))).toDF("id", "brands")
+  .write.mode("overwrite").csv(csvDir)
+  }.getMessage
+  assert(msg.contains("CSV data source does not support array 
data type"))
+
+  msg = intercept[UnsupportedOperationException] {
+ val schema = StructType.fromDDL("a array")
+ spark.range(1).write.mode("overwrite").csv(csvDir)
+ spark.read.schema(schema).csv(csvDir).collect()
+   }.getMessage
+  assert(msg.contains("CSV data source does not support array 
data type"))
+
+  msg = intercept[UnsupportedOperationException] {
+Seq((1, new UDT.MyDenseVector(Array(0.25, 2.25, 
4.25.toDF("id", "vectors")
+  .write.mode("overwrite").csv(csvDir)
+  }.getMessage
+  assert(msg.contains("CSV data source does not support array 
data type"))
+
+  msg = intercept[UnsupportedOperationException] {
+val schema = StructType(StructField("a", new 
UDT.MyDenseVectorUDT(), true) :: Nil)
+spark.range(1).write.mode("overwrite").csv(csvDir)
+spark.read.schema(schema).csv(csvDir).collect()
+  }.getMessage
+  assert(msg.contains("CSV data source does not support array 
data type."))
+}
+  }
+
+  test("SPARK-24204 error handling for unsupported Interval data types - 
csv, json, parquet, orc") {
+withTempDir { dir =>
+  val tempDir = new File(dir, "files").getCanonicalPath
+
+   Seq("orc", "json").foreach { format =>
--- End diff --

Also there is a space indent here, which should be removed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21389: [SPARK-24204][SQL] Verify a schema in Json/Orc/Pa...

2018-06-26 Thread gengliangwang

Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/21389#discussion_r198341944
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala ---
@@ -202,4 +204,230 @@ class FileBasedDataSourceSuite extends QueryTest with 
SharedSQLContext with Befo
   }
 }
   }
+
+  // Unsupported data types of csv, json, orc, and parquet are as follows;
+  //  csv -> R/W: Interval, Null, Array, Map, Struct
+  //  json -> W: Interval
+  //  orc -> W: Interval, Null
+  //  parquet -> R/W: Interval, Null
+  test("SPARK-24204 error handling for unsupported Array/Map/Struct types 
- csv") {
+withTempDir { dir =>
+  val csvDir = new File(dir, "csv").getCanonicalPath
+  var msg = intercept[UnsupportedOperationException] {
+Seq((1, "Tesla")).toDF("a", "b").selectExpr("struct(a, 
b)").write.csv(csvDir)
+  }.getMessage
+  assert(msg.contains("CSV data source does not support 
struct data type"))
+
+  msg = intercept[UnsupportedOperationException] {
+val schema = StructType.fromDDL("a struct")
+spark.range(1).write.mode("overwrite").csv(csvDir)
+spark.read.schema(schema).csv(csvDir).collect()
+  }.getMessage
+  assert(msg.contains("CSV data source does not support struct 
data type"))
+
+  msg = intercept[UnsupportedOperationException] {
+Seq((1, Map("Tesla" -> 3))).toDF("id", 
"cars").write.mode("overwrite").csv(csvDir)
+  }.getMessage
+  assert(msg.contains("CSV data source does not support 
map data type"))
+
+  msg = intercept[UnsupportedOperationException] {
+val schema = StructType.fromDDL("a map")
+spark.range(1).write.mode("overwrite").csv(csvDir)
+spark.read.schema(schema).csv(csvDir).collect()
+  }.getMessage
+  assert(msg.contains("CSV data source does not support map 
data type"))
+
+  msg = intercept[UnsupportedOperationException] {
+Seq((1, Array("Tesla", "Chevy", "Ford"))).toDF("id", "brands")
+  .write.mode("overwrite").csv(csvDir)
+  }.getMessage
+  assert(msg.contains("CSV data source does not support array 
data type"))
+
+  msg = intercept[UnsupportedOperationException] {
+ val schema = StructType.fromDDL("a array")
+ spark.range(1).write.mode("overwrite").csv(csvDir)
+ spark.read.schema(schema).csv(csvDir).collect()
+   }.getMessage
+  assert(msg.contains("CSV data source does not support array 
data type"))
+
+  msg = intercept[UnsupportedOperationException] {
+Seq((1, new UDT.MyDenseVector(Array(0.25, 2.25, 
4.25.toDF("id", "vectors")
+  .write.mode("overwrite").csv(csvDir)
+  }.getMessage
+  assert(msg.contains("CSV data source does not support array 
data type"))
+
+  msg = intercept[UnsupportedOperationException] {
+val schema = StructType(StructField("a", new 
UDT.MyDenseVectorUDT(), true) :: Nil)
+spark.range(1).write.mode("overwrite").csv(csvDir)
+spark.read.schema(schema).csv(csvDir).collect()
+  }.getMessage
+  assert(msg.contains("CSV data source does not support array 
data type."))
+}
+  }
+
+  test("SPARK-24204 error handling for unsupported Interval data types - 
csv, json, parquet, orc") {
+withTempDir { dir =>
+  val tempDir = new File(dir, "files").getCanonicalPath
+
+   Seq("orc", "json").foreach { format =>
+// write path
+var msg = intercept[AnalysisException] {
+  sql("select interval 1 
days").write.format(format).mode("overwrite").save(tempDir)
+}.getMessage
+assert(msg.contains("Cannot save interval data type into external 
storage."))
+
+msg = intercept[UnsupportedOperationException] {
+  spark.udf.register("testType", () => new IntervalData())
+  sql("select 
testType()").write.format(format).mode("overwrite").save(tempDir)
+}.getMessage
+assert(msg.toLowerCase(Locale.ROOT)
+  .contains(s"$format data source does not support 
calendarinterval data type."))
+
+// read path
+// We expect the types below should be passed for 
backward-compatibility
+
+// Interval type
+var schema = StructType(StructField("a", CalendarIntervalType, 
true) :: Nil)
+spark.range(1).write.format(format).mode("overwrite").save(tempDir)
+spark.read.schema(schema).format(format).load(tempDir).collect()
+
+// UDT having interval data
+schema = StructType(StructField("a", new IntervalUDT(), true) :: 
Nil)
+

[GitHub] spark pull request #21389: [SPARK-24204][SQL] Verify a schema in Json/Orc/Pa...

2018-06-26 Thread gengliangwang

Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/21389#discussion_r198341922
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala ---
@@ -202,4 +204,230 @@ class FileBasedDataSourceSuite extends QueryTest with 
SharedSQLContext with Befo
   }
 }
   }
+
+  // Unsupported data types of csv, json, orc, and parquet are as follows;
+  //  csv -> R/W: Interval, Null, Array, Map, Struct
+  //  json -> W: Interval
+  //  orc -> W: Interval, Null
+  //  parquet -> R/W: Interval, Null
+  test("SPARK-24204 error handling for unsupported Array/Map/Struct types 
- csv") {
+withTempDir { dir =>
+  val csvDir = new File(dir, "csv").getCanonicalPath
+  var msg = intercept[UnsupportedOperationException] {
+Seq((1, "Tesla")).toDF("a", "b").selectExpr("struct(a, 
b)").write.csv(csvDir)
+  }.getMessage
+  assert(msg.contains("CSV data source does not support 
struct data type"))
+
+  msg = intercept[UnsupportedOperationException] {
+val schema = StructType.fromDDL("a struct")
+spark.range(1).write.mode("overwrite").csv(csvDir)
+spark.read.schema(schema).csv(csvDir).collect()
+  }.getMessage
+  assert(msg.contains("CSV data source does not support struct 
data type"))
+
+  msg = intercept[UnsupportedOperationException] {
+Seq((1, Map("Tesla" -> 3))).toDF("id", 
"cars").write.mode("overwrite").csv(csvDir)
+  }.getMessage
+  assert(msg.contains("CSV data source does not support 
map data type"))
+
+  msg = intercept[UnsupportedOperationException] {
+val schema = StructType.fromDDL("a map")
+spark.range(1).write.mode("overwrite").csv(csvDir)
+spark.read.schema(schema).csv(csvDir).collect()
+  }.getMessage
+  assert(msg.contains("CSV data source does not support map 
data type"))
+
+  msg = intercept[UnsupportedOperationException] {
+Seq((1, Array("Tesla", "Chevy", "Ford"))).toDF("id", "brands")
+  .write.mode("overwrite").csv(csvDir)
+  }.getMessage
+  assert(msg.contains("CSV data source does not support array 
data type"))
+
+  msg = intercept[UnsupportedOperationException] {
+ val schema = StructType.fromDDL("a array")
+ spark.range(1).write.mode("overwrite").csv(csvDir)
+ spark.read.schema(schema).csv(csvDir).collect()
+   }.getMessage
+  assert(msg.contains("CSV data source does not support array 
data type"))
+
+  msg = intercept[UnsupportedOperationException] {
+Seq((1, new UDT.MyDenseVector(Array(0.25, 2.25, 
4.25.toDF("id", "vectors")
+  .write.mode("overwrite").csv(csvDir)
+  }.getMessage
+  assert(msg.contains("CSV data source does not support array 
data type"))
+
+  msg = intercept[UnsupportedOperationException] {
+val schema = StructType(StructField("a", new 
UDT.MyDenseVectorUDT(), true) :: Nil)
+spark.range(1).write.mode("overwrite").csv(csvDir)
+spark.read.schema(schema).csv(csvDir).collect()
+  }.getMessage
+  assert(msg.contains("CSV data source does not support array 
data type."))
+}
+  }
+
+  test("SPARK-24204 error handling for unsupported Interval data types - 
csv, json, parquet, orc") {
+withTempDir { dir =>
+  val tempDir = new File(dir, "files").getCanonicalPath
+
+   Seq("orc", "json").foreach { format =>
--- End diff --

Nit: we can put all the write path together to reduce duplicated code


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21389: [SPARK-24204][SQL] Verify a schema in Json/Orc/Pa...

2018-06-26 Thread gengliangwang

Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/21389#discussion_r198341823
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala ---
@@ -202,4 +204,230 @@ class FileBasedDataSourceSuite extends QueryTest with 
SharedSQLContext with Befo
   }
 }
   }
+
+  // Unsupported data types of csv, json, orc, and parquet are as follows;
+  //  csv -> R/W: Interval, Null, Array, Map, Struct
+  //  json -> W: Interval
+  //  orc -> W: Interval, Null
+  //  parquet -> R/W: Interval, Null
+  test("SPARK-24204 error handling for unsupported Array/Map/Struct types 
- csv") {
+withTempDir { dir =>
+  val csvDir = new File(dir, "csv").getCanonicalPath
--- End diff --

This is duplicated with `CSVSuite`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21389: [SPARK-24204][SQL] Verify a schema in Json/Orc/Pa...

2018-06-26 Thread gengliangwang

Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/21389#discussion_r198341248
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala ---
@@ -202,4 +204,230 @@ class FileBasedDataSourceSuite extends QueryTest with 
SharedSQLContext with Befo
   }
 }
   }
+
+  // Unsupported data types of csv, json, orc, and parquet are as follows;
+  //  csv -> R/W: Interval, Null, Array, Map, Struct
+  //  json -> W: Interval
+  //  orc -> W: Interval, Null
+  //  parquet -> R/W: Interval, Null
+  test("SPARK-24204 error handling for unsupported Array/Map/Struct types 
- csv") {
+withTempDir { dir =>
+  val csvDir = new File(dir, "csv").getCanonicalPath
+  var msg = intercept[UnsupportedOperationException] {
+Seq((1, "Tesla")).toDF("a", "b").selectExpr("struct(a, 
b)").write.csv(csvDir)
+  }.getMessage
+  assert(msg.contains("CSV data source does not support 
struct data type"))
+
+  msg = intercept[UnsupportedOperationException] {
+val schema = StructType.fromDDL("a struct")
+spark.range(1).write.mode("overwrite").csv(csvDir)
+spark.read.schema(schema).csv(csvDir).collect()
+  }.getMessage
+  assert(msg.contains("CSV data source does not support struct 
data type"))
+
+  msg = intercept[UnsupportedOperationException] {
+Seq((1, Map("Tesla" -> 3))).toDF("id", 
"cars").write.mode("overwrite").csv(csvDir)
+  }.getMessage
+  assert(msg.contains("CSV data source does not support 
map data type"))
+
+  msg = intercept[UnsupportedOperationException] {
+val schema = StructType.fromDDL("a map")
+spark.range(1).write.mode("overwrite").csv(csvDir)
+spark.read.schema(schema).csv(csvDir).collect()
+  }.getMessage
+  assert(msg.contains("CSV data source does not support map 
data type"))
+
+  msg = intercept[UnsupportedOperationException] {
+Seq((1, Array("Tesla", "Chevy", "Ford"))).toDF("id", "brands")
+  .write.mode("overwrite").csv(csvDir)
+  }.getMessage
+  assert(msg.contains("CSV data source does not support array 
data type"))
+
+  msg = intercept[UnsupportedOperationException] {
+ val schema = StructType.fromDDL("a array")
+ spark.range(1).write.mode("overwrite").csv(csvDir)
+ spark.read.schema(schema).csv(csvDir).collect()
+   }.getMessage
+  assert(msg.contains("CSV data source does not support array 
data type"))
+
+  msg = intercept[UnsupportedOperationException] {
+Seq((1, new UDT.MyDenseVector(Array(0.25, 2.25, 
4.25.toDF("id", "vectors")
+  .write.mode("overwrite").csv(csvDir)
+  }.getMessage
+  assert(msg.contains("CSV data source does not support array 
data type"))
+
+  msg = intercept[UnsupportedOperationException] {
+val schema = StructType(StructField("a", new 
UDT.MyDenseVectorUDT(), true) :: Nil)
+spark.range(1).write.mode("overwrite").csv(csvDir)
+spark.read.schema(schema).csv(csvDir).collect()
+  }.getMessage
+  assert(msg.contains("CSV data source does not support array 
data type."))
+}
+  }
+
+  test("SPARK-24204 error handling for unsupported Interval data types - 
csv, json, parquet, orc") {
+withTempDir { dir =>
+  val tempDir = new File(dir, "files").getCanonicalPath
+
+   Seq("orc", "json").foreach { format =>
+// write path
--- End diff --

space indent


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21389: [SPARK-24204][SQL] Verify a schema in Json/Orc/Pa...

2018-06-26 Thread gengliangwang

Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/21389#discussion_r198340297
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceUtils.scala
 ---
@@ -0,0 +1,101 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources
+
+import org.apache.spark.sql.execution.datasources.csv.CSVFileFormat
+import org.apache.spark.sql.execution.datasources.json.JsonFileFormat
+import org.apache.spark.sql.execution.datasources.orc.OrcFileFormat
+import org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat
+import org.apache.spark.sql.types._
+
+
+object DataSourceUtils {
+
+  /**
+   * Verify if the schema is supported in datasource in write path.
+   */
+  def verifyWriteSchema(format: FileFormat, schema: StructType): Unit = {
+verifySchema(format, schema, isReadPath = false)
+  }
+
+  /**
+   * Verify if the schema is supported in datasource in read path.
+   */
+  def verifyReadSchema(format: FileFormat, schema: StructType): Unit = {
+verifySchema(format, schema, isReadPath = true)
+  }
+
+  /**
+   * Verify if the schema is supported in datasource. This verification 
should be done
+   * in a driver side, e.g., `prepareWrite`, `buildReader`, and 
`buildReaderWithPartitionValues`
+   * in `FileFormat`.
+   *
+   * Unsupported data types of csv, json, orc, and parquet are as follows;
+   *  csv -> R/W: Interval, Null, Array, Map, Struct
+   *  json -> W: Interval
+   *  orc -> W: Interval, Null
+   *  parquet -> R/W: Interval, Null
+   */
+  private def verifySchema(format: FileFormat, schema: StructType, 
isReadPath: Boolean): Unit = {
+def throwUnsupportedException(dataType: DataType): Unit = {
+  throw new UnsupportedOperationException(
+s"$format data source does not support ${dataType.simpleString} 
data type.")
+}
+
+def verifyType(dataType: DataType): Unit = dataType match {
+  case BooleanType | ByteType | ShortType | IntegerType | LongType | 
FloatType | DoubleType |
+   StringType | BinaryType | DateType | TimestampType | _: 
DecimalType =>
+
+  case _: CalendarIntervalType | _: StructType | _: ArrayType | _: 
MapType
+  if format.isInstanceOf[CSVFileFormat] =>
+throwUnsupportedException(dataType)
+
+  case st: StructType => st.foreach { f => verifyType(f.dataType) }
+
+  case ArrayType(elementType, _) => verifyType(elementType)
+
+  case MapType(keyType, valueType, _) =>
+verifyType(keyType)
+verifyType(valueType)
+
+  // JSON and ORC don't support an Interval type, but we pass it in 
read pass
+  // for back-compatibility.
+  case _: CalendarIntervalType if isReadPath &&
--- End diff --

If `isReadPath` is `false`, we should always throw exception.
So we can simplify all the handling of `CalendarIntervalType` as following:
```
case _: CalendarIntervalType if !isReadPath => 
throwUnsupportedException(dataType)
case _: CalendarIntervalType if format.isInstanceOf[JsonFileFormat] | 
format.isInstanceOf[OrcFileFormat] => throwUnsupportedException(dataType)
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21389: [SPARK-24204][SQL] Verify a schema in Json/Orc/Pa...

2018-06-23 Thread maropu

Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/21389#discussion_r197626497
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceUtils.scala
 ---
@@ -0,0 +1,88 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources
+
+import org.apache.spark.sql.execution.datasources.csv.CSVFileFormat
+import org.apache.spark.sql.execution.datasources.json.JsonFileFormat
+import org.apache.spark.sql.execution.datasources.orc.OrcFileFormat
+import org.apache.spark.sql.types._
+
+
+object DataSourceUtils {
+
+  /**
+   * Verify if the schema is supported in datasource in write path.
+   */
+  def verifyWriteSchema(format: FileFormat, schema: StructType): Unit = {
+verifySchema(format, schema, isReadPath = false)
+  }
+
+  /**
+   * Verify if the schema is supported in datasource in read path.
+   */
+  def verifyReadSchema(format: FileFormat, schema: StructType): Unit = {
+verifySchema(format, schema, isReadPath = true)
+  }
+
+  /**
+   * Verify if the schema is supported in datasource. This verification 
should be done
+   * in a driver side, e.g., `prepareWrite`, `buildReader`, and 
`buildReaderWithPartitionValues`
+   * in `FileFormat`.
+   *
+   * Unsupported data types of csv, json, orc, and parquet are as follows;
+   *  csv -> R/W: Interval, Null, Array, Map, Struct
+   *  json -> W: Interval
+   *  orc -> W: Interval, Null
+   *  parquet -> R/W: Interval, Null
+   */
+  private def verifySchema(format: FileFormat, schema: StructType, 
isReadPath: Boolean): Unit = {
+def throwUnsupportedException(dataType: DataType): Unit = {
+  throw new UnsupportedOperationException(
+s"$format data source does not support ${dataType.simpleString} 
data type.")
+}
+
+def verifyType(dataType: DataType): Unit = dataType match {
+  case BooleanType | ByteType | ShortType | IntegerType | LongType | 
FloatType | DoubleType |
+   StringType | BinaryType | DateType | TimestampType | _: 
DecimalType =>
+
+  case _: StructType | _: ArrayType | _: MapType if 
format.isInstanceOf[CSVFileFormat] =>
+throwUnsupportedException(dataType)
+
+  case st: StructType => st.foreach { f => verifyType(f.dataType) }
+
+  case ArrayType(elementType, _) => verifyType(elementType)
+
+  case MapType(keyType, valueType, _) =>
+verifyType(keyType)
+verifyType(valueType)
+
+  case _: CalendarIntervalType if isReadPath && 
format.isInstanceOf[JsonFileFormat] ||
+isReadPath && format.isInstanceOf[OrcFileFormat] =>
--- End diff --

yea, I'll recheck and list up these in the end.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21389: [SPARK-24204][SQL] Verify a schema in Json/Orc/Pa...

2018-06-23 Thread maropu

Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/21389#discussion_r197626501
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala ---
@@ -202,4 +204,222 @@ class FileBasedDataSourceSuite extends QueryTest with 
SharedSQLContext with Befo
   }
 }
   }
+
+  // Unsupported data types of csv, json, orc, and parquet are as follows;
+  //  csv -> R/W: Interval, Null, Array, Map, Struct
+  //  json -> W: Interval
+  //  orc -> W: Interval, Null
+  //  parquet -> R/W: Interval, Null
+  test("SPARK-24204 error handling for unsupported data types") {
--- End diff --

ok


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21389: [SPARK-24204][SQL] Verify a schema in Json/Orc/Pa...

2018-06-23 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/21389#discussion_r197626306
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceUtils.scala
 ---
@@ -0,0 +1,88 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources
+
+import org.apache.spark.sql.execution.datasources.csv.CSVFileFormat
+import org.apache.spark.sql.execution.datasources.json.JsonFileFormat
+import org.apache.spark.sql.execution.datasources.orc.OrcFileFormat
+import org.apache.spark.sql.types._
+
+
+object DataSourceUtils {
+
+  /**
+   * Verify if the schema is supported in datasource in write path.
+   */
+  def verifyWriteSchema(format: FileFormat, schema: StructType): Unit = {
+verifySchema(format, schema, isReadPath = false)
+  }
+
+  /**
+   * Verify if the schema is supported in datasource in read path.
+   */
+  def verifyReadSchema(format: FileFormat, schema: StructType): Unit = {
+verifySchema(format, schema, isReadPath = true)
+  }
+
+  /**
+   * Verify if the schema is supported in datasource. This verification 
should be done
+   * in a driver side, e.g., `prepareWrite`, `buildReader`, and 
`buildReaderWithPartitionValues`
+   * in `FileFormat`.
+   *
+   * Unsupported data types of csv, json, orc, and parquet are as follows;
+   *  csv -> R/W: Interval, Null, Array, Map, Struct
+   *  json -> W: Interval
+   *  orc -> W: Interval, Null
+   *  parquet -> R/W: Interval, Null
+   */
+  private def verifySchema(format: FileFormat, schema: StructType, 
isReadPath: Boolean): Unit = {
+def throwUnsupportedException(dataType: DataType): Unit = {
+  throw new UnsupportedOperationException(
+s"$format data source does not support ${dataType.simpleString} 
data type.")
+}
+
+def verifyType(dataType: DataType): Unit = dataType match {
+  case BooleanType | ByteType | ShortType | IntegerType | LongType | 
FloatType | DoubleType |
+   StringType | BinaryType | DateType | TimestampType | _: 
DecimalType =>
+
+  case _: StructType | _: ArrayType | _: MapType if 
format.isInstanceOf[CSVFileFormat] =>
+throwUnsupportedException(dataType)
+
+  case st: StructType => st.foreach { f => verifyType(f.dataType) }
+
+  case ArrayType(elementType, _) => verifyType(elementType)
+
+  case MapType(keyType, valueType, _) =>
+verifyType(keyType)
+verifyType(valueType)
+
+  case _: CalendarIntervalType if isReadPath && 
format.isInstanceOf[JsonFileFormat] ||
+isReadPath && format.isInstanceOf[OrcFileFormat] =>
--- End diff --

Let us try to enumerate all the ones we want to block? Instead of relying 
on the last case to throw the exceptions?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21389: [SPARK-24204][SQL] Verify a schema in Json/Orc/Pa...

2018-06-23 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/21389#discussion_r197626232
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala ---
@@ -202,4 +204,222 @@ class FileBasedDataSourceSuite extends QueryTest with 
SharedSQLContext with Befo
   }
 }
   }
+
+  // Unsupported data types of csv, json, orc, and parquet are as follows;
+  //  csv -> R/W: Interval, Null, Array, Map, Struct
+  //  json -> W: Interval
+  //  orc -> W: Interval, Null
+  //  parquet -> R/W: Interval, Null
+  test("SPARK-24204 error handling for unsupported data types") {
--- End diff --

Can we split this test case to multiple smaller ones? It will be easy to 
understand what happens if any of them fail by a meaningful test case name.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21389: [SPARK-24204][SQL] Verify a schema in Json/Orc/Pa...

2018-06-23 Thread maropu

Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/21389#discussion_r197626168
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceUtils.scala
 ---
@@ -0,0 +1,88 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources
+
+import org.apache.spark.sql.execution.datasources.csv.CSVFileFormat
+import org.apache.spark.sql.execution.datasources.json.JsonFileFormat
+import org.apache.spark.sql.execution.datasources.orc.OrcFileFormat
+import org.apache.spark.sql.types._
+
+
+object DataSourceUtils {
+
+  /**
+   * Verify if the schema is supported in datasource in write path.
+   */
+  def verifyWriteSchema(format: FileFormat, schema: StructType): Unit = {
+verifySchema(format, schema, isReadPath = false)
+  }
+
+  /**
+   * Verify if the schema is supported in datasource in read path.
+   */
+  def verifyReadSchema(format: FileFormat, schema: StructType): Unit = {
+verifySchema(format, schema, isReadPath = true)
+  }
+
+  /**
+   * Verify if the schema is supported in datasource. This verification 
should be done
+   * in a driver side, e.g., `prepareWrite`, `buildReader`, and 
`buildReaderWithPartitionValues`
+   * in `FileFormat`.
+   *
+   * Unsupported data types of csv, json, orc, and parquet are as follows;
+   *  csv -> R/W: Interval, Null, Array, Map, Struct
+   *  json -> W: Interval
+   *  orc -> W: Interval, Null
+   *  parquet -> R/W: Interval, Null
+   */
+  private def verifySchema(format: FileFormat, schema: StructType, 
isReadPath: Boolean): Unit = {
+def throwUnsupportedException(dataType: DataType): Unit = {
+  throw new UnsupportedOperationException(
+s"$format data source does not support ${dataType.simpleString} 
data type.")
+}
+
+def verifyType(dataType: DataType): Unit = dataType match {
+  case BooleanType | ByteType | ShortType | IntegerType | LongType | 
FloatType | DoubleType |
+   StringType | BinaryType | DateType | TimestampType | _: 
DecimalType =>
+
+  case _: StructType | _: ArrayType | _: MapType if 
format.isInstanceOf[CSVFileFormat] =>
+throwUnsupportedException(dataType)
+
+  case st: StructType => st.foreach { f => verifyType(f.dataType) }
+
+  case ArrayType(elementType, _) => verifyType(elementType)
+
+  case MapType(keyType, valueType, _) =>
+verifyType(keyType)
+verifyType(valueType)
+
+  case _: CalendarIntervalType if isReadPath && 
format.isInstanceOf[JsonFileFormat] ||
+isReadPath && format.isInstanceOf[OrcFileFormat] =>
+
+  case udt: UserDefinedType[_] => verifyType(udt.sqlType)
+
+  // For JSON backward-compatibility
+  case NullType if format.isInstanceOf[JsonFileFormat] ||
+(isReadPath && format.isInstanceOf[OrcFileFormat]) =>
+
+  case _ => throwUnsupportedException(dataType)
--- End diff --

ok


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21389: [SPARK-24204][SQL] Verify a schema in Json/Orc/Pa...

2018-06-23 Thread maropu

Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/21389#discussion_r197626164
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceUtils.scala
 ---
@@ -0,0 +1,88 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources
+
+import org.apache.spark.sql.execution.datasources.csv.CSVFileFormat
+import org.apache.spark.sql.execution.datasources.json.JsonFileFormat
+import org.apache.spark.sql.execution.datasources.orc.OrcFileFormat
+import org.apache.spark.sql.types._
+
+
+object DataSourceUtils {
+
+  /**
+   * Verify if the schema is supported in datasource in write path.
+   */
+  def verifyWriteSchema(format: FileFormat, schema: StructType): Unit = {
+verifySchema(format, schema, isReadPath = false)
+  }
+
+  /**
+   * Verify if the schema is supported in datasource in read path.
+   */
+  def verifyReadSchema(format: FileFormat, schema: StructType): Unit = {
+verifySchema(format, schema, isReadPath = true)
+  }
+
+  /**
+   * Verify if the schema is supported in datasource. This verification 
should be done
+   * in a driver side, e.g., `prepareWrite`, `buildReader`, and 
`buildReaderWithPartitionValues`
+   * in `FileFormat`.
+   *
+   * Unsupported data types of csv, json, orc, and parquet are as follows;
+   *  csv -> R/W: Interval, Null, Array, Map, Struct
+   *  json -> W: Interval
+   *  orc -> W: Interval, Null
+   *  parquet -> R/W: Interval, Null
+   */
+  private def verifySchema(format: FileFormat, schema: StructType, 
isReadPath: Boolean): Unit = {
+def throwUnsupportedException(dataType: DataType): Unit = {
+  throw new UnsupportedOperationException(
+s"$format data source does not support ${dataType.simpleString} 
data type.")
+}
+
+def verifyType(dataType: DataType): Unit = dataType match {
+  case BooleanType | ByteType | ShortType | IntegerType | LongType | 
FloatType | DoubleType |
+   StringType | BinaryType | DateType | TimestampType | _: 
DecimalType =>
+
+  case _: StructType | _: ArrayType | _: MapType if 
format.isInstanceOf[CSVFileFormat] =>
+throwUnsupportedException(dataType)
+
+  case st: StructType => st.foreach { f => verifyType(f.dataType) }
+
+  case ArrayType(elementType, _) => verifyType(elementType)
+
+  case MapType(keyType, valueType, _) =>
+verifyType(keyType)
+verifyType(valueType)
+
+  case _: CalendarIntervalType if isReadPath && 
format.isInstanceOf[JsonFileFormat] ||
+isReadPath && format.isInstanceOf[OrcFileFormat] =>
--- End diff --

ok, will do


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21389: [SPARK-24204][SQL] Verify a schema in Json/Orc/Pa...

2018-06-23 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/21389#discussion_r197626154
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceUtils.scala
 ---
@@ -0,0 +1,88 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources
+
+import org.apache.spark.sql.execution.datasources.csv.CSVFileFormat
+import org.apache.spark.sql.execution.datasources.json.JsonFileFormat
+import org.apache.spark.sql.execution.datasources.orc.OrcFileFormat
+import org.apache.spark.sql.types._
+
+
+object DataSourceUtils {
+
+  /**
+   * Verify if the schema is supported in datasource in write path.
+   */
+  def verifyWriteSchema(format: FileFormat, schema: StructType): Unit = {
+verifySchema(format, schema, isReadPath = false)
+  }
+
+  /**
+   * Verify if the schema is supported in datasource in read path.
+   */
+  def verifyReadSchema(format: FileFormat, schema: StructType): Unit = {
+verifySchema(format, schema, isReadPath = true)
+  }
+
+  /**
+   * Verify if the schema is supported in datasource. This verification 
should be done
+   * in a driver side, e.g., `prepareWrite`, `buildReader`, and 
`buildReaderWithPartitionValues`
+   * in `FileFormat`.
+   *
+   * Unsupported data types of csv, json, orc, and parquet are as follows;
+   *  csv -> R/W: Interval, Null, Array, Map, Struct
+   *  json -> W: Interval
+   *  orc -> W: Interval, Null
+   *  parquet -> R/W: Interval, Null
+   */
+  private def verifySchema(format: FileFormat, schema: StructType, 
isReadPath: Boolean): Unit = {
+def throwUnsupportedException(dataType: DataType): Unit = {
+  throw new UnsupportedOperationException(
+s"$format data source does not support ${dataType.simpleString} 
data type.")
+}
+
+def verifyType(dataType: DataType): Unit = dataType match {
+  case BooleanType | ByteType | ShortType | IntegerType | LongType | 
FloatType | DoubleType |
+   StringType | BinaryType | DateType | TimestampType | _: 
DecimalType =>
+
+  case _: StructType | _: ArrayType | _: MapType if 
format.isInstanceOf[CSVFileFormat] =>
+throwUnsupportedException(dataType)
+
+  case st: StructType => st.foreach { f => verifyType(f.dataType) }
+
+  case ArrayType(elementType, _) => verifyType(elementType)
+
+  case MapType(keyType, valueType, _) =>
+verifyType(keyType)
+verifyType(valueType)
+
+  case _: CalendarIntervalType if isReadPath && 
format.isInstanceOf[JsonFileFormat] ||
+isReadPath && format.isInstanceOf[OrcFileFormat] =>
+
+  case udt: UserDefinedType[_] => verifyType(udt.sqlType)
+
+  // For JSON backward-compatibility
+  case NullType if format.isInstanceOf[JsonFileFormat] ||
+(isReadPath && format.isInstanceOf[OrcFileFormat]) =>
+
+  case _ => throwUnsupportedException(dataType)
--- End diff --

Write a comment above this? 
> // Actually we won't pass in unsupported data types, this is a safety 
check.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21389: [SPARK-24204][SQL] Verify a schema in Json/Orc/Pa...

2018-06-23 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/21389#discussion_r197626122
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceUtils.scala
 ---
@@ -0,0 +1,88 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources
+
+import org.apache.spark.sql.execution.datasources.csv.CSVFileFormat
+import org.apache.spark.sql.execution.datasources.json.JsonFileFormat
+import org.apache.spark.sql.execution.datasources.orc.OrcFileFormat
+import org.apache.spark.sql.types._
+
+
+object DataSourceUtils {
+
+  /**
+   * Verify if the schema is supported in datasource in write path.
+   */
+  def verifyWriteSchema(format: FileFormat, schema: StructType): Unit = {
+verifySchema(format, schema, isReadPath = false)
+  }
+
+  /**
+   * Verify if the schema is supported in datasource in read path.
+   */
+  def verifyReadSchema(format: FileFormat, schema: StructType): Unit = {
+verifySchema(format, schema, isReadPath = true)
+  }
+
+  /**
+   * Verify if the schema is supported in datasource. This verification 
should be done
+   * in a driver side, e.g., `prepareWrite`, `buildReader`, and 
`buildReaderWithPartitionValues`
+   * in `FileFormat`.
+   *
+   * Unsupported data types of csv, json, orc, and parquet are as follows;
+   *  csv -> R/W: Interval, Null, Array, Map, Struct
+   *  json -> W: Interval
+   *  orc -> W: Interval, Null
+   *  parquet -> R/W: Interval, Null
+   */
+  private def verifySchema(format: FileFormat, schema: StructType, 
isReadPath: Boolean): Unit = {
+def throwUnsupportedException(dataType: DataType): Unit = {
+  throw new UnsupportedOperationException(
+s"$format data source does not support ${dataType.simpleString} 
data type.")
+}
+
+def verifyType(dataType: DataType): Unit = dataType match {
+  case BooleanType | ByteType | ShortType | IntegerType | LongType | 
FloatType | DoubleType |
+   StringType | BinaryType | DateType | TimestampType | _: 
DecimalType =>
+
+  case _: StructType | _: ArrayType | _: MapType if 
format.isInstanceOf[CSVFileFormat] =>
+throwUnsupportedException(dataType)
+
+  case st: StructType => st.foreach { f => verifyType(f.dataType) }
+
+  case ArrayType(elementType, _) => verifyType(elementType)
+
+  case MapType(keyType, valueType, _) =>
+verifyType(keyType)
+verifyType(valueType)
+
+  case _: CalendarIntervalType if isReadPath && 
format.isInstanceOf[JsonFileFormat] ||
+isReadPath && format.isInstanceOf[OrcFileFormat] =>
--- End diff --

simplify the condition?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21389: [SPARK-24204][SQL] Verify a schema in Json/Orc/Pa...

2018-06-14 Thread maropu

Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/21389#discussion_r195579915
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceUtils.scala
 ---
@@ -0,0 +1,61 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources
+
+import org.apache.spark.sql.execution.datasources.csv.CSVFileFormat
+import org.apache.spark.sql.execution.datasources.json.JsonFileFormat
+import org.apache.spark.sql.types._
+
+
+object DataSourceUtils {
+
+  /**
+   * Verify if the schema is supported in datasource.
--- End diff --

ok


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21389: [SPARK-24204][SQL] Verify a schema in Json/Orc/Pa...

2018-06-14 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/21389#discussion_r195472992
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceUtils.scala
 ---
@@ -0,0 +1,61 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources
+
+import org.apache.spark.sql.execution.datasources.csv.CSVFileFormat
+import org.apache.spark.sql.execution.datasources.json.JsonFileFormat
+import org.apache.spark.sql.types._
+
+
+object DataSourceUtils {
+
+  /**
+   * Verify if the schema is supported in datasource.
--- End diff --

Please improve the description? and document which built-in file formats 
are covered by this function. Also document which data types are not supported 
for each data source? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21389: [SPARK-24204][SQL] Verify a schema in Json/Orc/Pa...

2018-05-24 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/21389#discussion_r190791115
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceUtils.scala
 ---
@@ -0,0 +1,53 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources
+
+import org.apache.spark.sql.types._
+
+
+object DataSourceUtils {
+
+  /**
+   * Verify if the schema is supported in datasource.
+   */
+  def verifySchema(format: String, schema: StructType): Unit = {
+def verifyType(dataType: DataType): Unit = dataType match {
+  case BooleanType | ByteType | ShortType | IntegerType | LongType | 
FloatType | DoubleType |
+   StringType | BinaryType | DateType | TimestampType | _: 
DecimalType =>
+
+  case st: StructType => st.foreach { f => verifyType(f.dataType) }
+
+  case ArrayType(elementType, _) => verifyType(elementType)
+
+  case MapType(keyType, valueType, _) =>
+verifyType(keyType)
+verifyType(valueType)
+
+  case udt: UserDefinedType[_] => verifyType(udt.sqlType)
+
+  // For backward-compatibility
--- End diff --

Yes, as long as it does not break anything. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21389: [SPARK-24204][SQL] Verify a schema in Json/Orc/Pa...

2018-05-23 Thread maropu

Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/21389#discussion_r190461717
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceUtils.scala
 ---
@@ -0,0 +1,53 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources
+
+import org.apache.spark.sql.types._
+
+
+object DataSourceUtils {
+
+  /**
+   * Verify if the schema is supported in datasource.
+   */
+  def verifySchema(format: String, schema: StructType): Unit = {
+def verifyType(dataType: DataType): Unit = dataType match {
+  case BooleanType | ByteType | ShortType | IntegerType | LongType | 
FloatType | DoubleType |
+   StringType | BinaryType | DateType | TimestampType | _: 
DecimalType =>
+
+  case st: StructType => st.foreach { f => verifyType(f.dataType) }
+
+  case ArrayType(elementType, _) => verifyType(elementType)
+
+  case MapType(keyType, valueType, _) =>
+verifyType(keyType)
+verifyType(valueType)
+
+  case udt: UserDefinedType[_] => verifyType(udt.sqlType)
+
+  // For backward-compatibility
+  case NullType if format == "JSON" =>
+
+  case _ =>
+throw new UnsupportedOperationException(
--- End diff --

ok


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21389: [SPARK-24204][SQL] Verify a schema in Json/Orc/Pa...

2018-05-23 Thread maropu

Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/21389#discussion_r190461628
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceUtils.scala
 ---
@@ -0,0 +1,53 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources
+
+import org.apache.spark.sql.types._
+
+
+object DataSourceUtils {
+
+  /**
+   * Verify if the schema is supported in datasource.
+   */
+  def verifySchema(format: String, schema: StructType): Unit = {
+def verifyType(dataType: DataType): Unit = dataType match {
+  case BooleanType | ByteType | ShortType | IntegerType | LongType | 
FloatType | DoubleType |
+   StringType | BinaryType | DateType | TimestampType | _: 
DecimalType =>
+
+  case st: StructType => st.foreach { f => verifyType(f.dataType) }
+
+  case ArrayType(elementType, _) => verifyType(elementType)
+
+  case MapType(keyType, valueType, _) =>
+verifyType(keyType)
+verifyType(valueType)
+
+  case udt: UserDefinedType[_] => verifyType(udt.sqlType)
+
+  // For backward-compatibility
--- End diff --

ok, I will.
Also, we need to merge this function with `CSVUtils.verifySchema` in this 
pr?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21389: [SPARK-24204][SQL] Verify a schema in Json/Orc/Pa...

2018-05-23 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/21389#discussion_r190461517
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceUtils.scala
 ---
@@ -0,0 +1,53 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources
+
+import org.apache.spark.sql.types._
+
+
+object DataSourceUtils {
+
+  /**
+   * Verify if the schema is supported in datasource.
+   */
+  def verifySchema(format: String, schema: StructType): Unit = {
+def verifyType(dataType: DataType): Unit = dataType match {
+  case BooleanType | ByteType | ShortType | IntegerType | LongType | 
FloatType | DoubleType |
+   StringType | BinaryType | DateType | TimestampType | _: 
DecimalType =>
+
+  case st: StructType => st.foreach { f => verifyType(f.dataType) }
+
+  case ArrayType(elementType, _) => verifyType(elementType)
+
+  case MapType(keyType, valueType, _) =>
+verifyType(keyType)
+verifyType(valueType)
+
+  case udt: UserDefinedType[_] => verifyType(udt.sqlType)
+
+  // For backward-compatibility
+  case NullType if format == "JSON" =>
+
+  case _ =>
+throw new UnsupportedOperationException(
--- End diff --

Basically, for such a PR, we need to check all the data types that we block 
and ensure no behavior change is introduced by this PR.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21389: [SPARK-24204][SQL] Verify a schema in Json/Orc/Pa...

2018-05-23 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/21389#discussion_r190461402
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceUtils.scala
 ---
@@ -0,0 +1,53 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources
+
+import org.apache.spark.sql.types._
+
+
+object DataSourceUtils {
+
+  /**
+   * Verify if the schema is supported in datasource.
+   */
+  def verifySchema(format: String, schema: StructType): Unit = {
+def verifyType(dataType: DataType): Unit = dataType match {
+  case BooleanType | ByteType | ShortType | IntegerType | LongType | 
FloatType | DoubleType |
+   StringType | BinaryType | DateType | TimestampType | _: 
DecimalType =>
+
+  case st: StructType => st.foreach { f => verifyType(f.dataType) }
+
+  case ArrayType(elementType, _) => verifyType(elementType)
+
+  case MapType(keyType, valueType, _) =>
+verifyType(keyType)
+verifyType(valueType)
+
+  case udt: UserDefinedType[_] => verifyType(udt.sqlType)
+
+  // For backward-compatibility
--- End diff --

Do we have any test case for this?




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21389: [SPARK-24204][SQL] Verify a schema in Json/Orc/Pa...

2018-05-23 Thread maropu

Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/21389#discussion_r190136535
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala
 ---
@@ -89,6 +89,8 @@ class OrcFileFormat
   job: Job,
   options: Map[String, String],
   dataSchema: StructType): OutputWriterFactory = {
+DataSourceUtils.verifySchema("ORC", dataSchema)
--- End diff --

yea, that's smart. I will update.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21389: [SPARK-24204][SQL] Verify a schema in Json/Orc/Pa...

2018-05-23 Thread dongjoon-hyun

Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/21389#discussion_r190130347
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala
 ---
@@ -89,6 +89,8 @@ class OrcFileFormat
   job: Job,
   options: Map[String, String],
   dataSchema: StructType): OutputWriterFactory = {
+DataSourceUtils.verifySchema("ORC", dataSchema)
--- End diff --

Thank you for refactoring the PR, @maropu ! What about using `shortName` 
instead of string literal "ORC" here? Then, we can have the same line like the 
following.
```
DataSourceUtils.verifySchema(shortName, dataSchema)
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21389: [SPARK-24204][SQL] Verify a schema in Json/Orc/Pa...

2018-05-22 Thread maropu

Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/21389#discussion_r189964737
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala
 ---
@@ -2408,4 +2409,53 @@ class JsonSuite extends QueryTest with 
SharedSQLContext with TestJsonData {
   spark.read.option("mode", "PERMISSIVE").option("encoding", 
"UTF-8").json(Seq(badJson).toDS()),
   Row(badJson))
   }
+
+  test("SPARK-24204 error handling for unsupported data types") {
--- End diff --

ok, I'll brush up the tests.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21389: [SPARK-24204][SQL] Verify a schema in Json/Orc/Pa...

2018-05-22 Thread dongjoon-hyun

Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/21389#discussion_r189964186
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala
 ---
@@ -2408,4 +2409,53 @@ class JsonSuite extends QueryTest with 
SharedSQLContext with TestJsonData {
   spark.read.option("mode", "PERMISSIVE").option("encoding", 
"UTF-8").json(Seq(badJson).toDS()),
   Row(badJson))
   }
+
+  test("SPARK-24204 error handling for unsupported data types") {
--- End diff --

The suite doesn't assume that all file-based data source has the same 
capability. In this PR, the test codes are almost the same and the only 
difference are the mapping tables. For example,
- `json` -> Interval
- `orc` -> Interval, Null
- `parquet` -> Interval, Null


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21389: [SPARK-24204][SQL] Verify a schema in Json/Orc/Pa...

2018-05-22 Thread maropu

Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/21389#discussion_r189961194
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala
 ---
@@ -2408,4 +2409,53 @@ class JsonSuite extends QueryTest with 
SharedSQLContext with TestJsonData {
   spark.read.option("mode", "PERMISSIVE").option("encoding", 
"UTF-8").json(Seq(badJson).toDS()),
   Row(badJson))
   }
+
+  test("SPARK-24204 error handling for unsupported data types") {
--- End diff --

(Probably, the suggestion is related to the one above) Essentially, the 
supported types are specific to datasource implementations, so I'm not 100% 
sure that it's the best to put these tests in `FileBasedDataSourceSuite `.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21389: [SPARK-24204][SQL] Verify a schema in Json/Orc/Pa...

2018-05-22 Thread dongjoon-hyun

Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/21389#discussion_r189952391
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala
 ---
@@ -2408,4 +2409,53 @@ class JsonSuite extends QueryTest with 
SharedSQLContext with TestJsonData {
   spark.read.option("mode", "PERMISSIVE").option("encoding", 
"UTF-8").json(Seq(badJson).toDS()),
   Row(badJson))
   }
+
+  test("SPARK-24204 error handling for unsupported data types") {
--- End diff --

Thank you for pinging me, @maropu .
Since this is all about file-based data sources, can we have all these test 
cases in `FileBasedDataSourceSuite`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21389: [SPARK-24204][SQL] Verify a schema in Json/Orc/Pa...

2018-05-22 Thread maropu

Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/21389#discussion_r189924257
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonUtils.scala
 ---
@@ -48,4 +49,33 @@ object JsonUtils {
   json.sample(withReplacement = false, options.samplingRatio, 1)
 }
   }
+
+  /**
+   * Verify if the schema is supported in JSON datasource.
+   */
+  def verifySchema(schema: StructType): Unit = {
--- End diff --

Since supported types are specific to data sources, I think we need to 
verify a schema in each file format implementations. But, yes these 
built-in format (orc and parquet) has the same supported types, so it might be 
better to move the code `veryfySchema` into somewhere (e.g., `DataSourceUtils` 
or something) for avoiding code duplication


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21389: [SPARK-24204][SQL] Verify a schema in Json/Orc/Pa...

2018-05-22 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21389#discussion_r189913422
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonUtils.scala
 ---
@@ -48,4 +49,33 @@ object JsonUtils {
   json.sample(withReplacement = false, options.samplingRatio, 1)
 }
   }
+
+  /**
+   * Verify if the schema is supported in JSON datasource.
+   */
+  def verifySchema(schema: StructType): Unit = {
--- End diff --

Hmm .. but wouldn't the supported types be very specific to data source?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21389: [SPARK-24204][SQL] Verify a schema in Json/Orc/Pa...

2018-05-22 Thread gengliangwang

Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/21389#discussion_r189872259
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonUtils.scala
 ---
@@ -48,4 +49,33 @@ object JsonUtils {
   json.sample(withReplacement = false, options.samplingRatio, 1)
 }
   }
+
+  /**
+   * Verify if the schema is supported in JSON datasource.
+   */
+  def verifySchema(schema: StructType): Unit = {
--- End diff --

The function `verifySchema` is very similar with the one in Orc/Parquet 
except the exception message. Should we put it into a util object?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21389: [SPARK-24204][SQL] Verify a schema in Json/Orc/Pa...

2018-05-21 Thread maropu

GitHub user maropu opened a pull request:

https://github.com/apache/spark/pull/21389

[SPARK-24204][SQL] Verify a schema in Json/Orc/ParquetFileFormat

## What changes were proposed in this pull request?
This pr added code to verify a schema in Json/Orc/ParquetFileFormat along 
with CSVFileFormat.

## How was this patch tested?
Added tests in `OrcSourceSuite`, `ParquetQuerySuite`, and `JsonSuite`.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/maropu/spark SPARK-24204

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21389.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21389


commit 0d88bcb58f9298bed433b8febc4c9cfb5d92f6a9
Author: Takeshi Yamamuro 
Date:   2018-05-08T01:55:26Z

Fix




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

38 matches

Mail list logo