[GitHub] spark pull request #22442: [SPARK-25447][SQL] Support JSON options by schema...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/22442 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22442: [SPARK-25447][SQL] Support JSON options by schema...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22442#discussion_r220407196 --- Diff: python/pyspark/sql/functions.py --- @@ -2328,11 +2328,14 @@ def to_json(col, options={}): @ignore_unicode_prefix @since(2.4) -def schema_of_json(col): +def schema_of_json(col, options={}): """ Parses a column containing a JSON string and infers its schema in DDL format. :param col: string column in json format +:param options: options to control parsing. accepts the same options as the JSON datasource + +.. note:: Since Spark 2.5, it accepts options to control schema inferring. --- End diff -- Let's convert this to: ``` .. versionchanged:: 2.5 it accepts `options` parameter to control schema inferring. ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22442: [SPARK-25447][SQL] Support JSON options by schema...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/22442#discussion_r219403620 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala --- @@ -3611,6 +3611,20 @@ object functions { */ def schema_of_json(e: Column): Column = withExpr(new SchemaOfJson(e.expr)) + /** + * Parses a column containing a JSON string and infers its schema using options. + * + * @param e a string column containing JSON data. + * @param options JSON datasource options that control JSON parsing and type inference. --- End diff -- As I see we don't fail . Simple example is if `multiLine` is enabled, `lineSep` is ignored. There are another examples. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22442: [SPARK-25447][SQL] Support JSON options by schema...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/22442#discussion_r219297029 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala --- @@ -3611,6 +3611,20 @@ object functions { */ def schema_of_json(e: Column): Column = withExpr(new SchemaOfJson(e.expr)) + /** + * Parses a column containing a JSON string and infers its schema using options. + * + * @param e a string column containing JSON data. + * @param options JSON datasource options that control JSON parsing and type inference. --- End diff -- do we fail if users currently specify an option in dataframereader that doesn't apply? if we don't i wouldn't fail here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22442: [SPARK-25447][SQL] Support JSON options by schema...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/22442#discussion_r218801255 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala --- @@ -3611,6 +3611,20 @@ object functions { */ def schema_of_json(e: Column): Column = withExpr(new SchemaOfJson(e.expr)) + /** + * Parses a column containing a JSON string and infers its schema using options. + * + * @param e a string column containing JSON data. + * @param options JSON datasource options that control JSON parsing and type inference. --- End diff -- > Silently ignoring provided options is worse I guess ... What about just output an error to logs? Even for now some options passed to `from_json` are ignored silently, for example `columnNameOfCorruptRecord`. `compression`, `mode`, `samplingRatio` and etc. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22442: [SPARK-25447][SQL] Support JSON options by schema...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22442#discussion_r218778695 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala --- @@ -3611,6 +3611,20 @@ object functions { */ def schema_of_json(e: Column): Column = withExpr(new SchemaOfJson(e.expr)) + /** + * Parses a column containing a JSON string and infers its schema using options. + * + * @param e a string column containing JSON data. + * @param options JSON datasource options that control JSON parsing and type inference. --- End diff -- But people wouldn't probably take a look for a ticket. Duplicated documentations are not a good idea as well. Silently ignoring provided options is worse I guess - we should probably throw an exception for `from_json` too in Spark 3.0. I thought we are going to do the similar stuff for parse mode as well - `DROPMALFORMED` one throws an exception? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22442: [SPARK-25447][SQL] Support JSON options by schema...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/22442#discussion_r218748820 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala --- @@ -3611,6 +3611,20 @@ object functions { */ def schema_of_json(e: Column): Column = withExpr(new SchemaOfJson(e.expr)) + /** + * Parses a column containing a JSON string and infers its schema using options. + * + * @param e a string column containing JSON data. + * @param options JSON datasource options that control JSON parsing and type inference. --- End diff -- I don't think it is good idea to throw an exception in this case. Let's look how the function could be used: ``` from_json('json_col, schema_of_json(, options), options) ``` Forcing users to filter options before passing them to `schema_of_json` is inconvenient from my point of view. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22442: [SPARK-25447][SQL] Support JSON options by schema...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22442#discussion_r218642844 --- Diff: sql/core/src/test/resources/sql-tests/inputs/json-functions.sql --- @@ -56,3 +56,8 @@ select from_json('[{"a": 1}, 2]', 'array>'); select to_json(array('1', '2', '3')); select to_json(array(array(1, 2, 3), array(4))); +-- infer schema of json literal using options +select schema_of_json('{"c1":1}', map('primitivesAsString', 'true')); +select schema_of_json('{"c1":01, "c2":0.1}', map('allowNumericLeadingZeros', 'true', 'prefersDecimal', 'true')); + + --- End diff -- nit: unneeded newline --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22442: [SPARK-25447][SQL] Support JSON options by schema...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22442#discussion_r218642815 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala --- @@ -3611,6 +3611,20 @@ object functions { */ def schema_of_json(e: Column): Column = withExpr(new SchemaOfJson(e.expr)) + /** + * Parses a column containing a JSON string and infers its schema using options. + * + * @param e a string column containing JSON data. + * @param options JSON datasource options that control JSON parsing and type inference. --- End diff -- How about we leave a link and whitelist effective options and throw an exception? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22442: [SPARK-25447][SQL] Support JSON options by schema...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/22442#discussion_r218545335 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala --- @@ -3611,6 +3611,20 @@ object functions { */ def schema_of_json(e: Column): Column = withExpr(new SchemaOfJson(e.expr)) + /** + * Parses a column containing a JSON string and infers its schema using options. + * + * @param e a string column containing JSON data. + * @param options JSON datasource options that control JSON parsing and type inference. --- End diff -- Sure but not all options can impact on schema inferring. @rxin Maybe it makes sense to list the options which I pointed out in the ticket: https://issues.apache.org/jira/browse/SPARK-25447 ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22442: [SPARK-25447][SQL] Support JSON options by schema...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/22442#discussion_r218250393 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala --- @@ -3611,6 +3611,20 @@ object functions { */ def schema_of_json(e: Column): Column = withExpr(new SchemaOfJson(e.expr)) + /** + * Parses a column containing a JSON string and infers its schema using options. + * + * @param e a string column containing JSON data. + * @param options JSON datasource options that control JSON parsing and type inference. --- End diff -- maybe you can say refer to DataFrameReader.json for the list of options --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22442: [SPARK-25447][SQL] Support JSON options by schema...
GitHub user MaxGekk opened a pull request: https://github.com/apache/spark/pull/22442 [SPARK-25447][SQL] Support JSON options by schema_of_json() ## What changes were proposed in this pull request? In the PR, I propose to extended the `schema_of_json()` function, and accept JSON options since they can impact on schema inferring. Purpose is to support the same options that `from_json` can use during schema inferring. ## How was this patch tested? Added SQL, Python and Scala tests (`JsonExpressionsSuite` and `JsonFunctionsSuite`) that checks JSON options are used. You can merge this pull request into a Git repository by running: $ git pull https://github.com/MaxGekk/spark-1 schema_of_json-options Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22442.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22442 commit 6a9ec940af1714a603d71f995201c4753b0e06c4 Author: Maxim Gekk Date: 2018-09-16T21:06:27Z Accept options in schema_of_json commit 68e1438e9bfdf9c4ab2cc68c251308f0463df4ef Author: Maxim Gekk Date: 2018-09-16T21:29:59Z Fix examples commit 3365e086f662da859dcd74c0973c7331925c5bcd Author: Maxim Gekk Date: 2018-09-17T10:12:02Z Added sql tests commit 62ef168336633e014c1656ff79f17205df6a81d8 Author: Maxim Gekk Date: 2018-09-17T11:46:49Z Added a signature which accepts options commit 9d3b1a2be52094c13c8543cccf3fc9c8d177e480 Author: Maxim Gekk Date: 2018-09-17T12:31:45Z Support options in PySpark --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org