[GitHub] spark pull request #19231: [SPARK-22002][SQL] Read JDBC table use custom sch...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19231 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19231: [SPARK-22002][SQL] Read JDBC table use custom sch...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19231#discussion_r139073124 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala --- @@ -768,31 +767,30 @@ object JdbcUtils extends Logging { } /** - * Parses the user specified customSchema option value to DataFrame schema, - * and returns it if it's all columns are equals to default schema's. + * Parses the user specified customSchema option value to DataFrame schema, and + * returns a schema that is replaced by the custom schema's dataType if column name is matched. */ def getCustomSchema( tableSchema: StructType, customSchema: String, nameEquality: Resolver): StructType = { -val userSchema = CatalystSqlParser.parseTableSchema(customSchema) +if (null != customSchema && customSchema.nonEmpty) { + val userSchema = CatalystSqlParser.parseTableSchema(customSchema) -SchemaUtils.checkColumnNameDuplication( - userSchema.map(_.name), "in the customSchema option value", nameEquality) - -val colNames = tableSchema.fieldNames.mkString(",") -val errorMsg = s"Please provide all the columns, all columns are: $colNames" -if (userSchema.size != tableSchema.size) { - throw new AnalysisException(errorMsg) -} + SchemaUtils.checkColumnNameDuplication( +userSchema.map(_.name), "in the customSchema option value", nameEquality) -// This is resolved by names, only check the column names. -userSchema.fieldNames.foreach { col => - tableSchema.find(f => nameEquality(f.name, col)).getOrElse { -throw new AnalysisException(errorMsg) + // This is resolved by names, use the custom filed dataType to replace the default dataType. + val newSchema = tableSchema.map { col => +userSchema.find(f => nameEquality(f.name, col.name)) match { + case Some(c) => col.copy(dataType = c.dataType) --- End diff -- Yes, we should keep the original nullability. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19231: [SPARK-22002][SQL] Read JDBC table use custom sch...
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/19231#discussion_r138950451 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala --- @@ -775,24 +775,23 @@ object JdbcUtils extends Logging { tableSchema: StructType, customSchema: String, nameEquality: Resolver): StructType = { -val userSchema = CatalystSqlParser.parseTableSchema(customSchema) +if (null != customSchema && customSchema.nonEmpty) { + val userSchema = CatalystSqlParser.parseTableSchema(customSchema) -SchemaUtils.checkColumnNameDuplication( - userSchema.map(_.name), "in the customSchema option value", nameEquality) - -val colNames = tableSchema.fieldNames.mkString(",") -val errorMsg = s"Please provide all the columns, all columns are: $colNames" -if (userSchema.size != tableSchema.size) { - throw new AnalysisException(errorMsg) -} + SchemaUtils.checkColumnNameDuplication( +userSchema.map(_.name), "in the customSchema option value", nameEquality) -// This is resolved by names, only check the column names. -userSchema.fieldNames.foreach { col => - tableSchema.find(f => nameEquality(f.name, col)).getOrElse { -throw new AnalysisException(errorMsg) + // This is resolved by names, use the custom filed dataType to replace the default dateType. + val newSchema = tableSchema.map { col => +userSchema.find(f => nameEquality(f.name, col.name)) match { + case Some(c) => col.copy(dataType = c.dataType, metadata = Metadata.empty) --- End diff -- Because [`scale`](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L305) is used to infer column types, we shouldn't remove it: https://github.com/apache/spark/blob/v2.2.0/sql/core/src/main/scala/org/apache/spark/sql/jdbc/OracleDialect.scala#L32 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19231: [SPARK-22002][SQL] Read JDBC table use custom sch...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19231#discussion_r138948991 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala --- @@ -775,24 +775,23 @@ object JdbcUtils extends Logging { tableSchema: StructType, customSchema: String, nameEquality: Resolver): StructType = { -val userSchema = CatalystSqlParser.parseTableSchema(customSchema) +if (null != customSchema && customSchema.nonEmpty) { + val userSchema = CatalystSqlParser.parseTableSchema(customSchema) -SchemaUtils.checkColumnNameDuplication( - userSchema.map(_.name), "in the customSchema option value", nameEquality) - -val colNames = tableSchema.fieldNames.mkString(",") -val errorMsg = s"Please provide all the columns, all columns are: $colNames" -if (userSchema.size != tableSchema.size) { - throw new AnalysisException(errorMsg) -} + SchemaUtils.checkColumnNameDuplication( +userSchema.map(_.name), "in the customSchema option value", nameEquality) -// This is resolved by names, only check the column names. -userSchema.fieldNames.foreach { col => - tableSchema.find(f => nameEquality(f.name, col)).getOrElse { -throw new AnalysisException(errorMsg) + // This is resolved by names, use the custom filed dataType to replace the default dateType. + val newSchema = tableSchema.map { col => +userSchema.find(f => nameEquality(f.name, col.name)) match { + case Some(c) => col.copy(dataType = c.dataType, metadata = Metadata.empty) --- End diff -- Why not changing the following line https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L309 to ``` fields(i) = StructField(columnName, columnType, nullable) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19231: [SPARK-22002][SQL] Read JDBC table use custom sch...
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/19231#discussion_r138948476 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala --- @@ -993,7 +996,10 @@ class JDBCSuite extends SparkFunSuite Seq(StructField("NAME", StringType, true), StructField("THEID", IntegerType, true))) val df = sql("select * from people_view") assert(df.schema.size === 2) - assert(df.schema === schema) --- End diff -- We shouldn't change it, because `scale` is used to infer column types: https://github.com/apache/spark/blob/v2.2.0/sql/core/src/main/scala/org/apache/spark/sql/jdbc/OracleDialect.scala#L32 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19231: [SPARK-22002][SQL] Read JDBC table use custom sch...
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/19231#discussion_r138946335 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala --- @@ -775,24 +775,23 @@ object JdbcUtils extends Logging { tableSchema: StructType, customSchema: String, nameEquality: Resolver): StructType = { -val userSchema = CatalystSqlParser.parseTableSchema(customSchema) +if (null != customSchema && customSchema.nonEmpty) { + val userSchema = CatalystSqlParser.parseTableSchema(customSchema) -SchemaUtils.checkColumnNameDuplication( - userSchema.map(_.name), "in the customSchema option value", nameEquality) - -val colNames = tableSchema.fieldNames.mkString(",") -val errorMsg = s"Please provide all the columns, all columns are: $colNames" -if (userSchema.size != tableSchema.size) { - throw new AnalysisException(errorMsg) -} + SchemaUtils.checkColumnNameDuplication( +userSchema.map(_.name), "in the customSchema option value", nameEquality) -// This is resolved by names, only check the column names. -userSchema.fieldNames.foreach { col => - tableSchema.find(f => nameEquality(f.name, col)).getOrElse { -throw new AnalysisException(errorMsg) + // This is resolved by names, use the custom filed dataType to replace the default dateType. + val newSchema = tableSchema.map { col => +userSchema.find(f => nameEquality(f.name, col.name)) match { + case Some(c) => col.copy(dataType = c.dataType, metadata = Metadata.empty) --- End diff -- Reset metadata to empty, otherwise it is not equal to the schema generated by `CatalystSqlParser.parseTableSchema`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19231: [SPARK-22002][SQL] Read JDBC table use custom sch...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19231#discussion_r138800677 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala --- @@ -993,7 +996,10 @@ class JDBCSuite extends SparkFunSuite Seq(StructField("NAME", StringType, true), StructField("THEID", IntegerType, true))) val df = sql("select * from people_view") assert(df.schema.size === 2) - assert(df.schema === schema) --- End diff -- revert it back. Change the following line https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L309 to ``` fields(i) = StructField(columnName, columnType, nullable) ``` You also need to update some test cases due to the above change, I think. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19231: [SPARK-22002][SQL] Read JDBC table use custom sch...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19231#discussion_r138797882 --- Diff: docs/sql-programming-guide.md --- @@ -1333,7 +1333,7 @@ the following case-insensitive options: customSchema - The custom schema to use for reading data from JDBC connectors. For example, "id DECIMAL(38, 0), name STRING"). The column names should be identical to the corresponding column names of JDBC table. Users can specify the corresponding data types of Spark SQL instead of using the defaults. This option applies only to reading. + The custom schema to use for reading data from JDBC connectors. For example, "id DECIMAL(38, 0), name STRING". You can also specify partial fields, others use default values. For example, "id DECIMAL(38, 0)". The column names should be identical to the corresponding column names of JDBC table. Users can specify the corresponding data types of Spark SQL instead of using the defaults. This option applies only to reading. --- End diff -- `others` -> `and the others use the default type mapping` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19231: [SPARK-22002][SQL] Read JDBC table use custom sch...
GitHub user wangyum opened a pull request: https://github.com/apache/spark/pull/19231 [SPARK-22002][SQL] Read JDBC table use custom schema support specify partial fields. ## What changes were proposed in this pull request? https://github.com/apache/spark/pull/18266 add a new feature to support read JDBC table use custom schema, but we must specify all the fields. For simplicity, this PR support specify partial fields. ## How was this patch tested? unit tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/wangyum/spark SPARK-22002 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19231.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19231 commit 9e7a8a471835d5e93a729c15d166451e79567447 Author: Yuming WangDate: 2017-09-14T04:26:46Z Read JDBC table use custom schema support specify partial fields. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org