[GitHub] spark pull request #21472: [SPARK-24445][SQL] Schema in json format for from...
Github user MaxGekk closed the pull request at: https://github.com/apache/spark/pull/21472 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21472: [SPARK-24445][SQL] Schema in json format for from...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/21472#discussion_r194992611 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala --- @@ -747,8 +748,13 @@ case class StructsToJson( object JsonExprUtils { - def validateSchemaLiteral(exp: Expression): StructType = exp match { -case Literal(s, StringType) => CatalystSqlParser.parseTableSchema(s.toString) + def validateSchemaLiteral(exp: Expression): DataType = exp match { +case Literal(s, StringType) => + try { +DataType.fromJson(s.toString) --- End diff -- @maropu @HyukjinKwon Please, have a look at the PR: https://github.com/apache/spark/pull/21550 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21472: [SPARK-24445][SQL] Schema in json format for from...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21472#discussion_r192584018 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala --- @@ -747,8 +748,13 @@ case class StructsToJson( object JsonExprUtils { - def validateSchemaLiteral(exp: Expression): StructType = exp match { -case Literal(s, StringType) => CatalystSqlParser.parseTableSchema(s.toString) + def validateSchemaLiteral(exp: Expression): DataType = exp match { +case Literal(s, StringType) => + try { +DataType.fromJson(s.toString) --- End diff -- If possible, I like @HyukjinKwon 's approach. I remember correctly we just keep json schema formats for back-compatibility. In future major releases, I think we possibly drop the support. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21472: [SPARK-24445][SQL] Schema in json format for from...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21472#discussion_r192583925 --- Diff: sql/core/src/test/resources/sql-tests/inputs/json-functions.sql --- @@ -25,6 +25,10 @@ select from_json('{"a":1}', 'a InvalidType'); select from_json('{"a":1}', 'a INT', named_struct('mode', 'PERMISSIVE')); select from_json('{"a":1}', 'a INT', map('mode', 1)); select from_json(); +-- from_json - schema in json format +select from_json('{"a":1}', '{"type":"struct","fields":[{"name":"a","type":"integer", "nullable":true}]}'); +select from_json('{"a":1}', '{"type":"map", "keyType":"string", "valueType":"integer","valueContainsNull":false}'); + --- End diff -- To make the output file changes smaller, can you add new tests in the end of file? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21472: [SPARK-24445][SQL] Schema in json format for from...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21472#discussion_r192380680 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala --- @@ -747,8 +748,13 @@ case class StructsToJson( object JsonExprUtils { - def validateSchemaLiteral(exp: Expression): StructType = exp match { -case Literal(s, StringType) => CatalystSqlParser.parseTableSchema(s.toString) + def validateSchemaLiteral(exp: Expression): DataType = exp match { +case Literal(s, StringType) => + try { +DataType.fromJson(s.toString) --- End diff -- I mean, like what we do in Python side: https://github.com/apache/spark/blob/4f1e38649ebc7710850b7c40e6fb355775e7bb7f/python/pyspark/sql/types.py#L808-L814 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21472: [SPARK-24445][SQL] Schema in json format for from...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21472#discussion_r192379878 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala --- @@ -747,8 +748,13 @@ case class StructsToJson( object JsonExprUtils { - def validateSchemaLiteral(exp: Expression): StructType = exp match { -case Literal(s, StringType) => CatalystSqlParser.parseTableSchema(s.toString) + def validateSchemaLiteral(exp: Expression): DataType = exp match { +case Literal(s, StringType) => + try { +DataType.fromJson(s.toString) --- End diff -- I mean, type string support by `CatalystSqlParser.parseDataType` (like array<...> or struct<...>) into `from_json`. This will support `struct` if I am not mistaken. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21472: [SPARK-24445][SQL] Schema in json format for from...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/21472#discussion_r192358871 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala --- @@ -747,8 +748,13 @@ case class StructsToJson( object JsonExprUtils { - def validateSchemaLiteral(exp: Expression): StructType = exp match { -case Literal(s, StringType) => CatalystSqlParser.parseTableSchema(s.toString) + def validateSchemaLiteral(exp: Expression): DataType = exp match { +case Literal(s, StringType) => + try { +DataType.fromJson(s.toString) --- End diff -- > Shall we add the support with type itself with CatalystSqlParser.parseDataType too? I will do but it won't solve customer's problem fully. > Also, are you able to use catalogString? I just check that: ```scala val schema = MapType(StringType, IntegerType).catalogString val ds = spark.sql( s""" |select from_json('{"a":1}', '$schema') """.stripMargin) ds.show() ``` and got this one: ``` extraneous input '<' expecting {'SELECT', 'FROM', ...}(line 1, pos 3) == SQL == map ---^^^ ; line 2 pos 7 ``` The same with ` val schema = new StructType().add("a", IntegerType).catalogString ` ``` == SQL == struct --^^^ ; line 2 pos 7 org.apache.spark.sql.AnalysisException ``` Am I doing something wrong? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21472: [SPARK-24445][SQL] Schema in json format for from...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21472#discussion_r192342580 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala --- @@ -747,8 +748,13 @@ case class StructsToJson( object JsonExprUtils { - def validateSchemaLiteral(exp: Expression): StructType = exp match { -case Literal(s, StringType) => CatalystSqlParser.parseTableSchema(s.toString) + def validateSchemaLiteral(exp: Expression): DataType = exp match { +case Literal(s, StringType) => + try { +DataType.fromJson(s.toString) --- End diff -- > Schema in DDL supports only `StructType` as root types. It is not possible to specify `MapType` like in the test: Shall we add the support with type itself with `CatalystSqlParser.parseDataType` too? Also, are you able to use `catalogString`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21472: [SPARK-24445][SQL] Schema in json format for from...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/21472#discussion_r192339647 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala --- @@ -747,8 +748,13 @@ case class StructsToJson( object JsonExprUtils { - def validateSchemaLiteral(exp: Expression): StructType = exp match { -case Literal(s, StringType) => CatalystSqlParser.parseTableSchema(s.toString) + def validateSchemaLiteral(exp: Expression): DataType = exp match { +case Literal(s, StringType) => + try { +DataType.fromJson(s.toString) --- End diff -- > How do they get the metadata ... Metadata is stored together with data in distributed fs and loaded by a standard facilities of language. > and how do they insert it into SQL? SQL statements are formed programmatically as strings, and loaded schemas are inserted in particular positions in the string ( you can think about it as quasiquotes in Scala). The formed sql statements are sent via JDBC to Spark. > Is that the only way to do it? Probably it is possible to convert schemas in JSON format to DDL format but: - it requires much more effort and time than just modifying 5 lines proposed in the PR - Schema in DDL supports only `StructType` as root types. It is not possible to specify `MapType` like in the test: https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala#L330-L345 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21472: [SPARK-24445][SQL] Schema in json format for from...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21472#discussion_r192323643 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala --- @@ -747,8 +748,13 @@ case class StructsToJson( object JsonExprUtils { - def validateSchemaLiteral(exp: Expression): StructType = exp match { -case Literal(s, StringType) => CatalystSqlParser.parseTableSchema(s.toString) + def validateSchemaLiteral(exp: Expression): DataType = exp match { +case Literal(s, StringType) => + try { +DataType.fromJson(s.toString) --- End diff -- Usually they should be consistent but we don't necessarily support the obsolete functionality newly and consistently. I'm not sure how common it is to write the JSON literal as a schema via SQL. How do they get the metadata and how do they insert it into SQL? Is that the only way to do it? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21472: [SPARK-24445][SQL] Schema in json format for from...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/21472#discussion_r192309953 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala --- @@ -747,8 +748,13 @@ case class StructsToJson( object JsonExprUtils { - def validateSchemaLiteral(exp: Expression): StructType = exp match { -case Literal(s, StringType) => CatalystSqlParser.parseTableSchema(s.toString) + def validateSchemaLiteral(exp: Expression): DataType = exp match { +case Literal(s, StringType) => + try { +DataType.fromJson(s.toString) --- End diff -- I believe we should support JSON format because: - Functionality of SQL and Scala (and other languages) DSL should be equal otherwise we push users to use Scala DSL because SQL has less features. - The feature allows to save/restore schema in JSON format. Customer's use case is to have data in JSON format + meta info including schema in JSON format too. Schema in JSON format gives them more opportunities for processing in programatic way. - For now JSON format give us more flexibility and allows `MapType` (and `ArrayType`) as the root type for result of `from_json` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21472: [SPARK-24445][SQL] Schema in json format for from...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21472#discussion_r192293089 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala --- @@ -747,8 +748,13 @@ case class StructsToJson( object JsonExprUtils { - def validateSchemaLiteral(exp: Expression): StructType = exp match { -case Literal(s, StringType) => CatalystSqlParser.parseTableSchema(s.toString) + def validateSchemaLiteral(exp: Expression): DataType = exp match { +case Literal(s, StringType) => + try { +DataType.fromJson(s.toString) --- End diff -- I don't think we should support JSON format here. DDL formatted schema is preferred. JSON in functions.scala is supported for backword compatibility because SQL functions wasn't added first. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21472: [SPARK-24445][SQL] Schema in json format for from...
GitHub user MaxGekk opened a pull request: https://github.com/apache/spark/pull/21472 [SPARK-24445][SQL] Schema in json format for from_json in SQL ## What changes were proposed in this pull request? In the PR, I propose to support schema in JSON format for the `from_json()` function in SQL as it has been already implemented in Scala DSL for example [there](https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L3225-L3229): ```scala val dataType = try { DataType.fromJson(schema) } catch { case NonFatal(_) => StructType.fromDDL(schema) } ``` The changes will allow to specify `MapType` in SQL that's impossible at the moment: ```sql select from_json('{"a":1}', '{"type":"map", "keyType":"string", "valueType":"integer","valueContainsNull":false}') ``` ## How was this patch tested? Added a couple cases to `json-functions.sql` You can merge this pull request into a Git repository by running: $ git pull https://github.com/MaxGekk/spark-1 from_json-sql-schema Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21472.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21472 commit 778162074cce7046f5f48d02780c8bac251fe96b Author: Maxim Gekk Date: 2018-05-31T18:49:23Z SQL tests for schema in json format commit 139ef7e98c27c5f24d7a25fec8789309657cb5d3 Author: Maxim Gekk Date: 2018-05-31T19:42:27Z Support schema in json format --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org