[GitHub] spark pull request #21667: [SPARK-24691][SQL]Add new API `supportDataType` i...

2018-06-28 Thread gengliangwang
GitHub user gengliangwang opened a pull request:

https://github.com/apache/spark/pull/21667

[SPARK-24691][SQL]Add new API `supportDataType` in FileFormat

## What changes were proposed in this pull request?

In https://github.com/apache/spark/pull/21389,  data source schema is 
validated before actual read/write. However,

1. Putting all the validations together in `DataSourceUtils` is tricky and 
hard to maintain. On second thought after review, I find that the 
`OrcFileFormat` in hive package is not matched, so that its validation wrong.
2.  `DataSourceUtils.verifyWriteSchema` and 
`DataSourceUtils.verifyReadSchema` is not supposed to be called in every file 
format. We can move them to some upper entry.
So, I propose we can add a new API `supportDataType` in FileFormat. Each 
file format can override the method to specify its supported/non-supported data 
types.

## How was this patch tested?

Unit test


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gengliangwang/spark refactorSchemaValidate

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21667.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21667


commit 7fdf6033b6778d06850e6ae5a0fd6e3fde76a5c2
Author: Gengliang Wang 
Date:   2018-06-28T16:32:44Z

refactor schema validation




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21667: [SPARK-24691][SQL]Add new API `supportDataType` i...

2018-06-29 Thread gengliangwang
Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/21667#discussion_r199070174
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcSourceSuite.scala 
---
@@ -156,28 +156,6 @@ class HiveOrcSourceSuite extends OrcSuite with 
TestHiveSingleton {
 sql("select testType()").write.mode("overwrite").orc(orcDir)
   }.getMessage
   assert(msg.contains("ORC data source does not support 
calendarinterval data type."))
-
-  // read path
--- End diff --

In read path, ORC should support `CalendarIntervalType` and `NullType`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21667: [SPARK-24691][SQL]Add new API `supportDataType` i...

2018-06-29 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21667#discussion_r199075466
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala
 ---
@@ -152,6 +152,16 @@ trait FileFormat {
 }
   }
 
+  /**
+   * Returns whether this format supports the given [[DataType]] in 
read/write path.
+   *
+   * By default all data types are supported except 
[[CalendarIntervalType]] in write path.
+   */
+  def supportDataType(dataType: DataType, isReadPath: Boolean): Boolean = 
dataType match {
--- End diff --

Hm, shouldn't we better whitelist them rather then blacklist?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21667: [SPARK-24691][SQL]Add new API `supportDataType` i...

2018-06-29 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21667#discussion_r199076368
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcSourceSuite.scala 
---
@@ -156,28 +156,6 @@ class HiveOrcSourceSuite extends OrcSuite with 
TestHiveSingleton {
 sql("select testType()").write.mode("overwrite").orc(orcDir)
   }.getMessage
   assert(msg.contains("ORC data source does not support 
calendarinterval data type."))
-
-  // read path
--- End diff --

Is there any write path test already?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21667: [SPARK-24691][SQL]Add new API `supportDataType` i...

2018-06-29 Thread gengliangwang
Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/21667#discussion_r199079744
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala
 ---
@@ -152,6 +152,16 @@ trait FileFormat {
 }
   }
 
+  /**
+   * Returns whether this format supports the given [[DataType]] in 
read/write path.
+   *
+   * By default all data types are supported except 
[[CalendarIntervalType]] in write path.
+   */
+  def supportDataType(dataType: DataType, isReadPath: Boolean): Boolean = 
dataType match {
--- End diff --

Blacklist is easier.
With whitelist , we will have to validate 
``` 
BooleanType | ByteType | ShortType | IntegerType | LongType | FloatType | 
DoubleType |
  StringType | BinaryType | DateType | TimestampType | DecimalType
```
Of course we can have a default function to process these. But if we add a 
new data source which didn't support all of them, the implementation will be 
verbose.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21667: [SPARK-24691][SQL]Add new API `supportDataType` i...

2018-06-29 Thread gengliangwang
Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/21667#discussion_r199081671
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcSourceSuite.scala 
---
@@ -156,28 +156,6 @@ class HiveOrcSourceSuite extends OrcSuite with 
TestHiveSingleton {
 sql("select testType()").write.mode("overwrite").orc(orcDir)
   }.getMessage
   assert(msg.contains("ORC data source does not support 
calendarinterval data type."))
-
-  // read path
--- End diff --

@HyukjinKwon No, the unit test is about unsupported data types, and ORC 
supports all data types in read path.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21667: [SPARK-24691][SQL]Add new API `supportDataType` i...

2018-06-29 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21667#discussion_r199094171
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcSourceSuite.scala 
---
@@ -156,28 +156,6 @@ class HiveOrcSourceSuite extends OrcSuite with 
TestHiveSingleton {
 sql("select testType()").write.mode("overwrite").orc(orcDir)
   }.getMessage
   assert(msg.contains("ORC data source does not support 
calendarinterval data type."))
-
-  // read path
--- End diff --

I mean the tests were negative tests. so I was expecting that we'd have 
positive tests.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21667: [SPARK-24691][SQL]Add new API `supportDataType` i...

2018-06-29 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21667#discussion_r199095666
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala
 ---
@@ -152,6 +152,16 @@ trait FileFormat {
 }
   }
 
+  /**
+   * Returns whether this format supports the given [[DataType]] in 
read/write path.
+   *
+   * By default all data types are supported except 
[[CalendarIntervalType]] in write path.
+   */
+  def supportDataType(dataType: DataType, isReadPath: Boolean): Boolean = 
dataType match {
--- End diff --

Might be easier to write but it doesn't consider if we happened to have 
some more types on the other hand. It should better be explicit on what we 
support on the other hand.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21667: [SPARK-24691][SQL]Add new API `supportDataType` i...

2018-06-29 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21667#discussion_r199096057
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala
 ---
@@ -152,6 +152,16 @@ trait FileFormat {
 }
   }
 
+  /**
+   * Returns whether this format supports the given [[DataType]] in 
read/write path.
+   *
+   * By default all data types are supported except 
[[CalendarIntervalType]] in write path.
+   */
+  def supportDataType(dataType: DataType, isReadPath: Boolean): Boolean = 
dataType match {
--- End diff --

I wrote CSV's with whitelisting before per @hvanhovell's comment long time 
ago. I was (am still) okay either way but might be good to leave a cc for him.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21667: [SPARK-24691][SQL]Add new API `supportDataType` i...

2018-07-02 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/21667#discussion_r199416587
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala 
---
@@ -306,6 +306,7 @@ case class FileSourceScanExec(
   }
 
   private lazy val inputRDD: RDD[InternalRow] = {
+DataSourceUtils.verifyReadSchema(relation.fileFormat, 
relation.dataSchema)
--- End diff --

In some formats, is this verification applied two times, right?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21667: [SPARK-24691][SQL]Add new API `supportDataType` i...

2018-07-02 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/21667#discussion_r199418036
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala
 ---
@@ -152,6 +152,16 @@ trait FileFormat {
 }
   }
 
+  /**
+   * Returns whether this format supports the given [[DataType]] in 
read/write path.
+   *
+   * By default all data types are supported except 
[[CalendarIntervalType]] in write path.
+   */
+  def supportDataType(dataType: DataType, isReadPath: Boolean): Boolean = 
dataType match {
--- End diff --

I like the whilelist, too. As @HyukjinKwon said, if someone implements a 
new type, the blacklist pass through it...


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21667: [SPARK-24691][SQL]Add new API `supportDataType` i...

2018-07-02 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/21667#discussion_r199418986
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonFileFormat.scala
 ---
@@ -30,7 +30,7 @@ import 
org.apache.spark.sql.catalyst.json.{JacksonGenerator, JacksonParser, JSON
 import org.apache.spark.sql.catalyst.util.CompressionCodecs
 import org.apache.spark.sql.execution.datasources._
 import org.apache.spark.sql.sources._
-import org.apache.spark.sql.types.{StringType, StructType}
+import org.apache.spark.sql.types._
--- End diff --

If we employ the blacklist, I think it'd be better that you don't fold 
these imports.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21667: [SPARK-24691][SQL]Add new API `supportDataType` i...

2018-07-02 Thread gengliangwang
Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/21667#discussion_r199494578
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala
 ---
@@ -152,6 +152,16 @@ trait FileFormat {
 }
   }
 
+  /**
+   * Returns whether this format supports the given [[DataType]] in 
read/write path.
+   *
+   * By default all data types are supported except 
[[CalendarIntervalType]] in write path.
+   */
+  def supportDataType(dataType: DataType, isReadPath: Boolean): Boolean = 
dataType match {
--- End diff --

Whitelist for all file formats is behavior change. There are external file 
sources like https://github.com/databricks/spark-avro , which we probably have 
to update the code to make it compatible.

Currently exceptions are thrown in `buildReader` / 
`buildReaderWithPartitionValues`/ `prepareWrite` for unsupported types. 

So overall I prefer blacklist.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21667: [SPARK-24691][SQL]Add new API `supportDataType` i...

2018-07-02 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21667#discussion_r199511369
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala
 ---
@@ -152,6 +152,16 @@ trait FileFormat {
 }
   }
 
+  /**
+   * Returns whether this format supports the given [[DataType]] in 
read/write path.
+   *
+   * By default all data types are supported except 
[[CalendarIntervalType]] in write path.
+   */
+  def supportDataType(dataType: DataType, isReadPath: Boolean): Boolean = 
dataType match {
--- End diff --

I meant the `case`s in the match in each implementation within Spark. I 
didn't mean about the semantic about the API itself. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21667: [SPARK-24691][SQL]Add new API `supportDataType` i...

2018-07-02 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21667#discussion_r199515518
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala
 ---
@@ -152,6 +152,16 @@ trait FileFormat {
 }
   }
 
+  /**
+   * Returns whether this format supports the given [[DataType]] in 
read/write path.
+   *
+   * By default all data types are supported except 
[[CalendarIntervalType]] in write path.
+   */
+  def supportDataType(dataType: DataType, isReadPath: Boolean): Boolean = 
dataType match {
--- End diff --

Also, do we really need this API? All what it does it is just to check the 
type and throw an exception.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21667: [SPARK-24691][SQL]Add new API `supportDataType` i...

2018-07-02 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21667#discussion_r199515847
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceUtils.scala
 ---
@@ -42,63 +38,27 @@ object DataSourceUtils {
 
   /**
* Verify if the schema is supported in datasource. This verification 
should be done
-   * in a driver side, e.g., `prepareWrite`, `buildReader`, and 
`buildReaderWithPartitionValues`
-   * in `FileFormat`.
-   *
-   * Unsupported data types of csv, json, orc, and parquet are as follows;
-   *  csv -> R/W: Interval, Null, Array, Map, Struct
-   *  json -> W: Interval
-   *  orc -> W: Interval, Null
-   *  parquet -> R/W: Interval, Null
+   * in a driver side.
*/
   private def verifySchema(format: FileFormat, schema: StructType, 
isReadPath: Boolean): Unit = {
-def throwUnsupportedException(dataType: DataType): Unit = {
-  throw new UnsupportedOperationException(
-s"$format data source does not support ${dataType.simpleString} 
data type.")
-}
-
-def verifyType(dataType: DataType): Unit = dataType match {
-  case BooleanType | ByteType | ShortType | IntegerType | LongType | 
FloatType | DoubleType |
-   StringType | BinaryType | DateType | TimestampType | _: 
DecimalType =>
-
-  // All the unsupported types for CSV
-  case _: NullType | _: CalendarIntervalType | _: StructType | _: 
ArrayType | _: MapType
-  if format.isInstanceOf[CSVFileFormat] =>
-throwUnsupportedException(dataType)
-
-  case st: StructType => st.foreach { f => verifyType(f.dataType) }
-
-  case ArrayType(elementType, _) => verifyType(elementType)
-
-  case MapType(keyType, valueType, _) =>
-verifyType(keyType)
-verifyType(valueType)
-
-  case udt: UserDefinedType[_] => verifyType(udt.sqlType)
-
-  // Interval type not supported in all the write path
-  case _: CalendarIntervalType if !isReadPath =>
-throwUnsupportedException(dataType)
-
-  // JSON and ORC don't support an Interval type, but we pass it in 
read pass
-  // for back-compatibility.
-  case _: CalendarIntervalType if format.isInstanceOf[JsonFileFormat] 
||
-format.isInstanceOf[OrcFileFormat] =>
+def verifyType(dataType: DataType): Unit = {
+  if (!format.supportDataType(dataType, isReadPath)) {
+throw new UnsupportedOperationException(
+  s"$format data source does not support ${dataType.simpleString} 
data type.")
+  }
+  dataType match {
--- End diff --

Wait .. why we do the recursive thing here? What if the top level type is 
supported but nested is not? For example, Arrow doesn't currently support 
nested timestamp conversion for localization issue.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21667: [SPARK-24691][SQL]Add new API `supportDataType` i...

2018-07-02 Thread gengliangwang
Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/21667#discussion_r199519380
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala
 ---
@@ -152,6 +152,16 @@ trait FileFormat {
 }
   }
 
+  /**
+   * Returns whether this format supports the given [[DataType]] in 
read/write path.
+   *
+   * By default all data types are supported except 
[[CalendarIntervalType]] in write path.
+   */
+  def supportDataType(dataType: DataType, isReadPath: Boolean): Boolean = 
dataType match {
--- End diff --

@HyukjinKwon Sorry I don't understand. Do you mean the default case is not 
supported?
 ```
case _  =>  false
```
But how to make all the external formats work?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21667: [SPARK-24691][SQL]Add new API `supportDataType` i...

2018-07-02 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21667#discussion_r199523308
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala
 ---
@@ -152,6 +152,16 @@ trait FileFormat {
 }
   }
 
+  /**
+   * Returns whether this format supports the given [[DataType]] in 
read/write path.
+   *
+   * By default all data types are supported except 
[[CalendarIntervalType]] in write path.
+   */
+  def supportDataType(dataType: DataType, isReadPath: Boolean): Boolean = 
dataType match {
--- End diff --

OK. I meant, leaving the default case `true`

```scala
def supportDataType(...): Boolean = dataType match {
  case _ => true
}
```

and whitelist each type within each implementation, for example, in 
`CSVFileFormat.scala`

```scala
def supportDataType(...) ...
case _: StringType | ... => true
case _ => false
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21667: [SPARK-24691][SQL]Add new API `supportDataType` i...

2018-07-02 Thread gengliangwang
Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/21667#discussion_r199524768
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceUtils.scala
 ---
@@ -42,63 +38,27 @@ object DataSourceUtils {
 
   /**
* Verify if the schema is supported in datasource. This verification 
should be done
-   * in a driver side, e.g., `prepareWrite`, `buildReader`, and 
`buildReaderWithPartitionValues`
-   * in `FileFormat`.
-   *
-   * Unsupported data types of csv, json, orc, and parquet are as follows;
-   *  csv -> R/W: Interval, Null, Array, Map, Struct
-   *  json -> W: Interval
-   *  orc -> W: Interval, Null
-   *  parquet -> R/W: Interval, Null
+   * in a driver side.
*/
   private def verifySchema(format: FileFormat, schema: StructType, 
isReadPath: Boolean): Unit = {
-def throwUnsupportedException(dataType: DataType): Unit = {
-  throw new UnsupportedOperationException(
-s"$format data source does not support ${dataType.simpleString} 
data type.")
-}
-
-def verifyType(dataType: DataType): Unit = dataType match {
-  case BooleanType | ByteType | ShortType | IntegerType | LongType | 
FloatType | DoubleType |
-   StringType | BinaryType | DateType | TimestampType | _: 
DecimalType =>
-
-  // All the unsupported types for CSV
-  case _: NullType | _: CalendarIntervalType | _: StructType | _: 
ArrayType | _: MapType
-  if format.isInstanceOf[CSVFileFormat] =>
-throwUnsupportedException(dataType)
-
-  case st: StructType => st.foreach { f => verifyType(f.dataType) }
-
-  case ArrayType(elementType, _) => verifyType(elementType)
-
-  case MapType(keyType, valueType, _) =>
-verifyType(keyType)
-verifyType(valueType)
-
-  case udt: UserDefinedType[_] => verifyType(udt.sqlType)
-
-  // Interval type not supported in all the write path
-  case _: CalendarIntervalType if !isReadPath =>
-throwUnsupportedException(dataType)
-
-  // JSON and ORC don't support an Interval type, but we pass it in 
read pass
-  // for back-compatibility.
-  case _: CalendarIntervalType if format.isInstanceOf[JsonFileFormat] 
||
-format.isInstanceOf[OrcFileFormat] =>
+def verifyType(dataType: DataType): Unit = {
+  if (!format.supportDataType(dataType, isReadPath)) {
+throw new UnsupportedOperationException(
+  s"$format data source does not support ${dataType.simpleString} 
data type.")
+  }
+  dataType match {
--- End diff --

This is for general purpose, so that developer can skip matching 
arrays/maps/structs.
I don't know about nested timestamp, but we can override supportDataType to 
make sure the case is unsupported, right?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21667: [SPARK-24691][SQL]Add new API `supportDataType` i...

2018-07-02 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21667#discussion_r199525496
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceUtils.scala
 ---
@@ -42,63 +38,27 @@ object DataSourceUtils {
 
   /**
* Verify if the schema is supported in datasource. This verification 
should be done
-   * in a driver side, e.g., `prepareWrite`, `buildReader`, and 
`buildReaderWithPartitionValues`
-   * in `FileFormat`.
-   *
-   * Unsupported data types of csv, json, orc, and parquet are as follows;
-   *  csv -> R/W: Interval, Null, Array, Map, Struct
-   *  json -> W: Interval
-   *  orc -> W: Interval, Null
-   *  parquet -> R/W: Interval, Null
+   * in a driver side.
*/
   private def verifySchema(format: FileFormat, schema: StructType, 
isReadPath: Boolean): Unit = {
-def throwUnsupportedException(dataType: DataType): Unit = {
-  throw new UnsupportedOperationException(
-s"$format data source does not support ${dataType.simpleString} 
data type.")
-}
-
-def verifyType(dataType: DataType): Unit = dataType match {
-  case BooleanType | ByteType | ShortType | IntegerType | LongType | 
FloatType | DoubleType |
-   StringType | BinaryType | DateType | TimestampType | _: 
DecimalType =>
-
-  // All the unsupported types for CSV
-  case _: NullType | _: CalendarIntervalType | _: StructType | _: 
ArrayType | _: MapType
-  if format.isInstanceOf[CSVFileFormat] =>
-throwUnsupportedException(dataType)
-
-  case st: StructType => st.foreach { f => verifyType(f.dataType) }
-
-  case ArrayType(elementType, _) => verifyType(elementType)
-
-  case MapType(keyType, valueType, _) =>
-verifyType(keyType)
-verifyType(valueType)
-
-  case udt: UserDefinedType[_] => verifyType(udt.sqlType)
-
-  // Interval type not supported in all the write path
-  case _: CalendarIntervalType if !isReadPath =>
-throwUnsupportedException(dataType)
-
-  // JSON and ORC don't support an Interval type, but we pass it in 
read pass
-  // for back-compatibility.
-  case _: CalendarIntervalType if format.isInstanceOf[JsonFileFormat] 
||
-format.isInstanceOf[OrcFileFormat] =>
+def verifyType(dataType: DataType): Unit = {
+  if (!format.supportDataType(dataType, isReadPath)) {
+throw new UnsupportedOperationException(
+  s"$format data source does not support ${dataType.simpleString} 
data type.")
+  }
+  dataType match {
--- End diff --

Yea.. then this code bit is not really general purpose anymore ... 
developers should check the codes inside and see if the nested types are 
automatically checked or not ..


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21667: [SPARK-24691][SQL]Add new API `supportDataType` i...

2018-07-02 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21667#discussion_r199526695
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala
 ---
@@ -152,6 +152,16 @@ trait FileFormat {
 }
   }
 
+  /**
+   * Returns whether this format supports the given [[DataType]] in 
read/write path.
+   *
+   * By default all data types are supported except 
[[CalendarIntervalType]] in write path.
--- End diff --

FYI, `CalendarIntervalType` isn't completely public yet .. cc @cloud-fan.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21667: [SPARK-24691][SQL]Add new API `supportDataType` i...

2018-07-02 Thread gengliangwang
Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/21667#discussion_r199529548
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceUtils.scala
 ---
@@ -42,63 +38,27 @@ object DataSourceUtils {
 
   /**
* Verify if the schema is supported in datasource. This verification 
should be done
-   * in a driver side, e.g., `prepareWrite`, `buildReader`, and 
`buildReaderWithPartitionValues`
-   * in `FileFormat`.
-   *
-   * Unsupported data types of csv, json, orc, and parquet are as follows;
-   *  csv -> R/W: Interval, Null, Array, Map, Struct
-   *  json -> W: Interval
-   *  orc -> W: Interval, Null
-   *  parquet -> R/W: Interval, Null
+   * in a driver side.
*/
   private def verifySchema(format: FileFormat, schema: StructType, 
isReadPath: Boolean): Unit = {
-def throwUnsupportedException(dataType: DataType): Unit = {
-  throw new UnsupportedOperationException(
-s"$format data source does not support ${dataType.simpleString} 
data type.")
-}
-
-def verifyType(dataType: DataType): Unit = dataType match {
-  case BooleanType | ByteType | ShortType | IntegerType | LongType | 
FloatType | DoubleType |
-   StringType | BinaryType | DateType | TimestampType | _: 
DecimalType =>
-
-  // All the unsupported types for CSV
-  case _: NullType | _: CalendarIntervalType | _: StructType | _: 
ArrayType | _: MapType
-  if format.isInstanceOf[CSVFileFormat] =>
-throwUnsupportedException(dataType)
-
-  case st: StructType => st.foreach { f => verifyType(f.dataType) }
-
-  case ArrayType(elementType, _) => verifyType(elementType)
-
-  case MapType(keyType, valueType, _) =>
-verifyType(keyType)
-verifyType(valueType)
-
-  case udt: UserDefinedType[_] => verifyType(udt.sqlType)
-
-  // Interval type not supported in all the write path
-  case _: CalendarIntervalType if !isReadPath =>
-throwUnsupportedException(dataType)
-
-  // JSON and ORC don't support an Interval type, but we pass it in 
read pass
-  // for back-compatibility.
-  case _: CalendarIntervalType if format.isInstanceOf[JsonFileFormat] 
||
-format.isInstanceOf[OrcFileFormat] =>
+def verifyType(dataType: DataType): Unit = {
+  if (!format.supportDataType(dataType, isReadPath)) {
+throw new UnsupportedOperationException(
+  s"$format data source does not support ${dataType.simpleString} 
data type.")
+  }
+  dataType match {
--- End diff --

I know..But if developers didn't read inside and process the case of  
arrays/maps/structs, the code should still work.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21667: [SPARK-24691][SQL]Add new API `supportDataType` i...

2018-07-02 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21667#discussion_r199531608
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceUtils.scala
 ---
@@ -42,63 +38,27 @@ object DataSourceUtils {
 
   /**
* Verify if the schema is supported in datasource. This verification 
should be done
-   * in a driver side, e.g., `prepareWrite`, `buildReader`, and 
`buildReaderWithPartitionValues`
-   * in `FileFormat`.
-   *
-   * Unsupported data types of csv, json, orc, and parquet are as follows;
-   *  csv -> R/W: Interval, Null, Array, Map, Struct
-   *  json -> W: Interval
-   *  orc -> W: Interval, Null
-   *  parquet -> R/W: Interval, Null
+   * in a driver side.
*/
   private def verifySchema(format: FileFormat, schema: StructType, 
isReadPath: Boolean): Unit = {
-def throwUnsupportedException(dataType: DataType): Unit = {
-  throw new UnsupportedOperationException(
-s"$format data source does not support ${dataType.simpleString} 
data type.")
-}
-
-def verifyType(dataType: DataType): Unit = dataType match {
-  case BooleanType | ByteType | ShortType | IntegerType | LongType | 
FloatType | DoubleType |
-   StringType | BinaryType | DateType | TimestampType | _: 
DecimalType =>
-
-  // All the unsupported types for CSV
-  case _: NullType | _: CalendarIntervalType | _: StructType | _: 
ArrayType | _: MapType
-  if format.isInstanceOf[CSVFileFormat] =>
-throwUnsupportedException(dataType)
-
-  case st: StructType => st.foreach { f => verifyType(f.dataType) }
-
-  case ArrayType(elementType, _) => verifyType(elementType)
-
-  case MapType(keyType, valueType, _) =>
-verifyType(keyType)
-verifyType(valueType)
-
-  case udt: UserDefinedType[_] => verifyType(udt.sqlType)
-
-  // Interval type not supported in all the write path
-  case _: CalendarIntervalType if !isReadPath =>
-throwUnsupportedException(dataType)
-
-  // JSON and ORC don't support an Interval type, but we pass it in 
read pass
-  // for back-compatibility.
-  case _: CalendarIntervalType if format.isInstanceOf[JsonFileFormat] 
||
-format.isInstanceOf[OrcFileFormat] =>
+def verifyType(dataType: DataType): Unit = {
+  if (!format.supportDataType(dataType, isReadPath)) {
+throw new UnsupportedOperationException(
+  s"$format data source does not support ${dataType.simpleString} 
data type.")
+  }
+  dataType match {
--- End diff --

In that case, this code bit becomes rather obsolete .. To me Spark's dev 
API is too difficult for me to understand :-) .. Personally, I don't like to be 
too clever when it comes to API thing.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21667: [SPARK-24691][SQL]Add new API `supportDataType` i...

2018-07-02 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21667#discussion_r199694805
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceUtils.scala
 ---
@@ -42,63 +38,27 @@ object DataSourceUtils {
 
   /**
* Verify if the schema is supported in datasource. This verification 
should be done
-   * in a driver side, e.g., `prepareWrite`, `buildReader`, and 
`buildReaderWithPartitionValues`
-   * in `FileFormat`.
-   *
-   * Unsupported data types of csv, json, orc, and parquet are as follows;
-   *  csv -> R/W: Interval, Null, Array, Map, Struct
-   *  json -> W: Interval
-   *  orc -> W: Interval, Null
-   *  parquet -> R/W: Interval, Null
+   * in a driver side.
*/
   private def verifySchema(format: FileFormat, schema: StructType, 
isReadPath: Boolean): Unit = {
-def throwUnsupportedException(dataType: DataType): Unit = {
-  throw new UnsupportedOperationException(
-s"$format data source does not support ${dataType.simpleString} 
data type.")
-}
-
-def verifyType(dataType: DataType): Unit = dataType match {
-  case BooleanType | ByteType | ShortType | IntegerType | LongType | 
FloatType | DoubleType |
-   StringType | BinaryType | DateType | TimestampType | _: 
DecimalType =>
-
-  // All the unsupported types for CSV
-  case _: NullType | _: CalendarIntervalType | _: StructType | _: 
ArrayType | _: MapType
-  if format.isInstanceOf[CSVFileFormat] =>
-throwUnsupportedException(dataType)
-
-  case st: StructType => st.foreach { f => verifyType(f.dataType) }
-
-  case ArrayType(elementType, _) => verifyType(elementType)
-
-  case MapType(keyType, valueType, _) =>
-verifyType(keyType)
-verifyType(valueType)
-
-  case udt: UserDefinedType[_] => verifyType(udt.sqlType)
-
-  // Interval type not supported in all the write path
-  case _: CalendarIntervalType if !isReadPath =>
-throwUnsupportedException(dataType)
-
-  // JSON and ORC don't support an Interval type, but we pass it in 
read pass
-  // for back-compatibility.
-  case _: CalendarIntervalType if format.isInstanceOf[JsonFileFormat] 
||
-format.isInstanceOf[OrcFileFormat] =>
+def verifyType(dataType: DataType): Unit = {
+  if (!format.supportDataType(dataType, isReadPath)) {
+throw new UnsupportedOperationException(
+  s"$format data source does not support ${dataType.simpleString} 
data type.")
+  }
+  dataType match {
--- End diff --

It's tricky to rely on 2 places to correctly determine the unsupported 
type. `format.supportDataType` should handle complex types themselves, to make 
the code clearer and easier to maintain.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21667: [SPARK-24691][SQL]Add new API `supportDataType` i...

2018-07-02 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21667#discussion_r199695118
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala
 ---
@@ -152,6 +152,16 @@ trait FileFormat {
 }
   }
 
+  /**
+   * Returns whether this format supports the given [[DataType]] in 
read/write path.
+   *
+   * By default all data types are supported except 
[[CalendarIntervalType]] in write path.
--- End diff --

yea, it's not by default, we can't write out interval type. This check is 
in `DataSource.planForWriting`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21667: [SPARK-24691][SQL]Add new API `supportDataType` i...

2018-07-02 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21667#discussion_r199695311
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala
 ---
@@ -152,6 +152,16 @@ trait FileFormat {
 }
   }
 
+  /**
+   * Returns whether this format supports the given [[DataType]] in 
read/write path.
+   *
+   * By default all data types are supported except 
[[CalendarIntervalType]] in write path.
+   */
+  def supportDataType(dataType: DataType, isReadPath: Boolean): Boolean = 
dataType match {
--- End diff --

blacklist is easier, but whitelist is safer.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21667: [SPARK-24691][SQL]Add new API `supportDataType` i...

2018-07-03 Thread gengliangwang
Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/21667#discussion_r199705148
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceUtils.scala
 ---
@@ -42,63 +38,27 @@ object DataSourceUtils {
 
   /**
* Verify if the schema is supported in datasource. This verification 
should be done
-   * in a driver side, e.g., `prepareWrite`, `buildReader`, and 
`buildReaderWithPartitionValues`
-   * in `FileFormat`.
-   *
-   * Unsupported data types of csv, json, orc, and parquet are as follows;
-   *  csv -> R/W: Interval, Null, Array, Map, Struct
-   *  json -> W: Interval
-   *  orc -> W: Interval, Null
-   *  parquet -> R/W: Interval, Null
+   * in a driver side.
*/
   private def verifySchema(format: FileFormat, schema: StructType, 
isReadPath: Boolean): Unit = {
-def throwUnsupportedException(dataType: DataType): Unit = {
-  throw new UnsupportedOperationException(
-s"$format data source does not support ${dataType.simpleString} 
data type.")
-}
-
-def verifyType(dataType: DataType): Unit = dataType match {
-  case BooleanType | ByteType | ShortType | IntegerType | LongType | 
FloatType | DoubleType |
-   StringType | BinaryType | DateType | TimestampType | _: 
DecimalType =>
-
-  // All the unsupported types for CSV
-  case _: NullType | _: CalendarIntervalType | _: StructType | _: 
ArrayType | _: MapType
-  if format.isInstanceOf[CSVFileFormat] =>
-throwUnsupportedException(dataType)
-
-  case st: StructType => st.foreach { f => verifyType(f.dataType) }
-
-  case ArrayType(elementType, _) => verifyType(elementType)
-
-  case MapType(keyType, valueType, _) =>
-verifyType(keyType)
-verifyType(valueType)
-
-  case udt: UserDefinedType[_] => verifyType(udt.sqlType)
-
-  // Interval type not supported in all the write path
-  case _: CalendarIntervalType if !isReadPath =>
-throwUnsupportedException(dataType)
-
-  // JSON and ORC don't support an Interval type, but we pass it in 
read pass
-  // for back-compatibility.
-  case _: CalendarIntervalType if format.isInstanceOf[JsonFileFormat] 
||
-format.isInstanceOf[OrcFileFormat] =>
+def verifyType(dataType: DataType): Unit = {
+  if (!format.supportDataType(dataType, isReadPath)) {
+throw new UnsupportedOperationException(
+  s"$format data source does not support ${dataType.simpleString} 
data type.")
+  }
+  dataType match {
--- End diff --

I see. I will update it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org