[GitHub] spark pull request #19231: [SPARK-22002][SQL] Read JDBC table use custom sch...

2017-09-15 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19231


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19231: [SPARK-22002][SQL] Read JDBC table use custom sch...

2017-09-15 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19231#discussion_r139073124
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala
 ---
@@ -768,31 +767,30 @@ object JdbcUtils extends Logging {
   }
 
   /**
-   * Parses the user specified customSchema option value to DataFrame 
schema,
-   * and returns it if it's all columns are equals to default schema's.
+   * Parses the user specified customSchema option value to DataFrame 
schema, and
+   * returns a schema that is replaced by the custom schema's dataType if 
column name is matched.
*/
   def getCustomSchema(
   tableSchema: StructType,
   customSchema: String,
   nameEquality: Resolver): StructType = {
-val userSchema = CatalystSqlParser.parseTableSchema(customSchema)
+if (null != customSchema && customSchema.nonEmpty) {
+  val userSchema = CatalystSqlParser.parseTableSchema(customSchema)
 
-SchemaUtils.checkColumnNameDuplication(
-  userSchema.map(_.name), "in the customSchema option value", 
nameEquality)
-
-val colNames = tableSchema.fieldNames.mkString(",")
-val errorMsg = s"Please provide all the columns, all columns are: 
$colNames"
-if (userSchema.size != tableSchema.size) {
-  throw new AnalysisException(errorMsg)
-}
+  SchemaUtils.checkColumnNameDuplication(
+userSchema.map(_.name), "in the customSchema option value", 
nameEquality)
 
-// This is resolved by names, only check the column names.
-userSchema.fieldNames.foreach { col =>
-  tableSchema.find(f => nameEquality(f.name, col)).getOrElse {
-throw new AnalysisException(errorMsg)
+  // This is resolved by names, use the custom filed dataType to 
replace the default dataType.
+  val newSchema = tableSchema.map { col =>
+userSchema.find(f => nameEquality(f.name, col.name)) match {
+  case Some(c) => col.copy(dataType = c.dataType)
--- End diff --

Yes, we should keep the original nullability. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19231: [SPARK-22002][SQL] Read JDBC table use custom sch...

2017-09-14 Thread wangyum
Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/19231#discussion_r138950451
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala
 ---
@@ -775,24 +775,23 @@ object JdbcUtils extends Logging {
   tableSchema: StructType,
   customSchema: String,
   nameEquality: Resolver): StructType = {
-val userSchema = CatalystSqlParser.parseTableSchema(customSchema)
+if (null != customSchema && customSchema.nonEmpty) {
+  val userSchema = CatalystSqlParser.parseTableSchema(customSchema)
 
-SchemaUtils.checkColumnNameDuplication(
-  userSchema.map(_.name), "in the customSchema option value", 
nameEquality)
-
-val colNames = tableSchema.fieldNames.mkString(",")
-val errorMsg = s"Please provide all the columns, all columns are: 
$colNames"
-if (userSchema.size != tableSchema.size) {
-  throw new AnalysisException(errorMsg)
-}
+  SchemaUtils.checkColumnNameDuplication(
+userSchema.map(_.name), "in the customSchema option value", 
nameEquality)
 
-// This is resolved by names, only check the column names.
-userSchema.fieldNames.foreach { col =>
-  tableSchema.find(f => nameEquality(f.name, col)).getOrElse {
-throw new AnalysisException(errorMsg)
+  // This is resolved by names, use the custom filed dataType to 
replace the default dateType.
+  val newSchema = tableSchema.map { col =>
+userSchema.find(f => nameEquality(f.name, col.name)) match {
+  case Some(c) => col.copy(dataType = c.dataType, metadata = 
Metadata.empty)
--- End diff --

Because 
[`scale`](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L305)
 is used to infer column types, we shouldn't remove it: 
https://github.com/apache/spark/blob/v2.2.0/sql/core/src/main/scala/org/apache/spark/sql/jdbc/OracleDialect.scala#L32


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19231: [SPARK-22002][SQL] Read JDBC table use custom sch...

2017-09-14 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19231#discussion_r138948991
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala
 ---
@@ -775,24 +775,23 @@ object JdbcUtils extends Logging {
   tableSchema: StructType,
   customSchema: String,
   nameEquality: Resolver): StructType = {
-val userSchema = CatalystSqlParser.parseTableSchema(customSchema)
+if (null != customSchema && customSchema.nonEmpty) {
+  val userSchema = CatalystSqlParser.parseTableSchema(customSchema)
 
-SchemaUtils.checkColumnNameDuplication(
-  userSchema.map(_.name), "in the customSchema option value", 
nameEquality)
-
-val colNames = tableSchema.fieldNames.mkString(",")
-val errorMsg = s"Please provide all the columns, all columns are: 
$colNames"
-if (userSchema.size != tableSchema.size) {
-  throw new AnalysisException(errorMsg)
-}
+  SchemaUtils.checkColumnNameDuplication(
+userSchema.map(_.name), "in the customSchema option value", 
nameEquality)
 
-// This is resolved by names, only check the column names.
-userSchema.fieldNames.foreach { col =>
-  tableSchema.find(f => nameEquality(f.name, col)).getOrElse {
-throw new AnalysisException(errorMsg)
+  // This is resolved by names, use the custom filed dataType to 
replace the default dateType.
+  val newSchema = tableSchema.map { col =>
+userSchema.find(f => nameEquality(f.name, col.name)) match {
+  case Some(c) => col.copy(dataType = c.dataType, metadata = 
Metadata.empty)
--- End diff --

Why not changing the following line 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L309
 
to
```
fields(i) = StructField(columnName, columnType, nullable)
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19231: [SPARK-22002][SQL] Read JDBC table use custom sch...

2017-09-14 Thread wangyum
Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/19231#discussion_r138948476
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala 
---
@@ -993,7 +996,10 @@ class JDBCSuite extends SparkFunSuite
 Seq(StructField("NAME", StringType, true), StructField("THEID", 
IntegerType, true)))
   val df = sql("select * from people_view")
   assert(df.schema.size === 2)
-  assert(df.schema === schema)
--- End diff --

We shouldn't change it, because `scale` is used to infer column types: 
https://github.com/apache/spark/blob/v2.2.0/sql/core/src/main/scala/org/apache/spark/sql/jdbc/OracleDialect.scala#L32



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19231: [SPARK-22002][SQL] Read JDBC table use custom sch...

2017-09-14 Thread wangyum
Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/19231#discussion_r138946335
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala
 ---
@@ -775,24 +775,23 @@ object JdbcUtils extends Logging {
   tableSchema: StructType,
   customSchema: String,
   nameEquality: Resolver): StructType = {
-val userSchema = CatalystSqlParser.parseTableSchema(customSchema)
+if (null != customSchema && customSchema.nonEmpty) {
+  val userSchema = CatalystSqlParser.parseTableSchema(customSchema)
 
-SchemaUtils.checkColumnNameDuplication(
-  userSchema.map(_.name), "in the customSchema option value", 
nameEquality)
-
-val colNames = tableSchema.fieldNames.mkString(",")
-val errorMsg = s"Please provide all the columns, all columns are: 
$colNames"
-if (userSchema.size != tableSchema.size) {
-  throw new AnalysisException(errorMsg)
-}
+  SchemaUtils.checkColumnNameDuplication(
+userSchema.map(_.name), "in the customSchema option value", 
nameEquality)
 
-// This is resolved by names, only check the column names.
-userSchema.fieldNames.foreach { col =>
-  tableSchema.find(f => nameEquality(f.name, col)).getOrElse {
-throw new AnalysisException(errorMsg)
+  // This is resolved by names, use the custom filed dataType to 
replace the default dateType.
+  val newSchema = tableSchema.map { col =>
+userSchema.find(f => nameEquality(f.name, col.name)) match {
+  case Some(c) => col.copy(dataType = c.dataType, metadata = 
Metadata.empty)
--- End diff --

Reset metadata to empty, otherwise it is not equal to the schema generated 
by `CatalystSqlParser.parseTableSchema`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19231: [SPARK-22002][SQL] Read JDBC table use custom sch...

2017-09-13 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19231#discussion_r138800677
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala 
---
@@ -993,7 +996,10 @@ class JDBCSuite extends SparkFunSuite
 Seq(StructField("NAME", StringType, true), StructField("THEID", 
IntegerType, true)))
   val df = sql("select * from people_view")
   assert(df.schema.size === 2)
-  assert(df.schema === schema)
--- End diff --

revert it back. 

Change the following line 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L309
 
to
```
fields(i) = StructField(columnName, columnType, nullable)
```

You also need to update some test cases due to the above change, I think.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19231: [SPARK-22002][SQL] Read JDBC table use custom sch...

2017-09-13 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19231#discussion_r138797882
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1333,7 +1333,7 @@ the following case-insensitive options:
   
 customSchema
 
- The custom schema to use for reading data from JDBC connectors. For 
example, "id DECIMAL(38, 0), name STRING"). The column names should be 
identical to the corresponding column names of JDBC table. Users can specify 
the corresponding data types of Spark SQL instead of using the defaults. This 
option applies only to reading.
+ The custom schema to use for reading data from JDBC connectors. For 
example, "id DECIMAL(38, 0), name STRING". You can also specify 
partial fields, others use default values. For example, "id DECIMAL(38, 
0)". The column names should be identical to the corresponding column 
names of JDBC table. Users can specify the corresponding data types of Spark 
SQL instead of using the defaults. This option applies only to reading.
--- End diff --

`others` -> `and the others use the default type mapping`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19231: [SPARK-22002][SQL] Read JDBC table use custom sch...

2017-09-13 Thread wangyum
GitHub user wangyum opened a pull request:

https://github.com/apache/spark/pull/19231

[SPARK-22002][SQL] Read JDBC table use custom schema support specify 
partial fields.

## What changes were proposed in this pull request?

https://github.com/apache/spark/pull/18266 add a new feature to support 
read JDBC table use custom schema, but we must specify all the fields. For 
simplicity, this PR support  specify partial fields.

## How was this patch tested?
unit tests

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wangyum/spark SPARK-22002

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19231.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19231


commit 9e7a8a471835d5e93a729c15d166451e79567447
Author: Yuming Wang 
Date:   2017-09-14T04:26:46Z

Read JDBC table use custom schema support specify partial fields.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org