[GitHub] spark pull request #22442: [SPARK-25447][SQL] Support JSON options by schema...

2018-09-29 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/22442


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22442: [SPARK-25447][SQL] Support JSON options by schema...

2018-09-25 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22442#discussion_r220407196
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -2328,11 +2328,14 @@ def to_json(col, options={}):
 
 @ignore_unicode_prefix
 @since(2.4)
-def schema_of_json(col):
+def schema_of_json(col, options={}):
 """
 Parses a column containing a JSON string and infers its schema in DDL 
format.
 
 :param col: string column in json format
+:param options: options to control parsing. accepts the same options 
as the JSON datasource
+
+.. note:: Since Spark 2.5, it accepts options to control schema 
inferring.
--- End diff --

Let's convert this to:

```
.. versionchanged:: 2.5
   it accepts `options` parameter to control schema inferring.
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22442: [SPARK-25447][SQL] Support JSON options by schema...

2018-09-21 Thread MaxGekk
Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/22442#discussion_r219403620
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala ---
@@ -3611,6 +3611,20 @@ object functions {
*/
   def schema_of_json(e: Column): Column = withExpr(new 
SchemaOfJson(e.expr))
 
+  /**
+   * Parses a column containing a JSON string and infers its schema using 
options.
+   *
+   * @param e a string column containing JSON data.
+   * @param options JSON datasource options that control JSON parsing and 
type inference.
--- End diff --

As I see we don't fail . Simple example is if `multiLine` is enabled, 
`lineSep` is ignored. There are another examples.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22442: [SPARK-25447][SQL] Support JSON options by schema...

2018-09-20 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/22442#discussion_r219297029
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala ---
@@ -3611,6 +3611,20 @@ object functions {
*/
   def schema_of_json(e: Column): Column = withExpr(new 
SchemaOfJson(e.expr))
 
+  /**
+   * Parses a column containing a JSON string and infers its schema using 
options.
+   *
+   * @param e a string column containing JSON data.
+   * @param options JSON datasource options that control JSON parsing and 
type inference.
--- End diff --

do we fail if users currently specify an option in dataframereader that 
doesn't apply? if we don't i wouldn't fail here.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22442: [SPARK-25447][SQL] Support JSON options by schema...

2018-09-19 Thread MaxGekk
Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/22442#discussion_r218801255
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala ---
@@ -3611,6 +3611,20 @@ object functions {
*/
   def schema_of_json(e: Column): Column = withExpr(new 
SchemaOfJson(e.expr))
 
+  /**
+   * Parses a column containing a JSON string and infers its schema using 
options.
+   *
+   * @param e a string column containing JSON data.
+   * @param options JSON datasource options that control JSON parsing and 
type inference.
--- End diff --

> Silently ignoring provided options is worse I guess ...

What about just output an error to logs?

Even for now some options passed to `from_json` are ignored silently, for 
example `columnNameOfCorruptRecord`. `compression`, `mode`, `samplingRatio` and 
etc.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22442: [SPARK-25447][SQL] Support JSON options by schema...

2018-09-19 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22442#discussion_r218778695
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala ---
@@ -3611,6 +3611,20 @@ object functions {
*/
   def schema_of_json(e: Column): Column = withExpr(new 
SchemaOfJson(e.expr))
 
+  /**
+   * Parses a column containing a JSON string and infers its schema using 
options.
+   *
+   * @param e a string column containing JSON data.
+   * @param options JSON datasource options that control JSON parsing and 
type inference.
--- End diff --

But people wouldn't probably take a look for a ticket. Duplicated 
documentations are not a good idea as well.

Silently ignoring provided options is worse I guess - we should probably 
throw an exception for `from_json` too in Spark 3.0.

I thought we are going to do the similar stuff for parse mode as well - 
`DROPMALFORMED` one throws an exception?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22442: [SPARK-25447][SQL] Support JSON options by schema...

2018-09-19 Thread MaxGekk
Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/22442#discussion_r218748820
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala ---
@@ -3611,6 +3611,20 @@ object functions {
*/
   def schema_of_json(e: Column): Column = withExpr(new 
SchemaOfJson(e.expr))
 
+  /**
+   * Parses a column containing a JSON string and infers its schema using 
options.
+   *
+   * @param e a string column containing JSON data.
+   * @param options JSON datasource options that control JSON parsing and 
type inference.
--- End diff --

I don't think it is good idea to throw an exception in this case. Let's 
look how the function could be used:
```
from_json('json_col, schema_of_json(, options), options)
```
Forcing users to filter options before passing them to `schema_of_json` is 
inconvenient from my point of view.  


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22442: [SPARK-25447][SQL] Support JSON options by schema...

2018-09-18 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22442#discussion_r218642844
  
--- Diff: sql/core/src/test/resources/sql-tests/inputs/json-functions.sql 
---
@@ -56,3 +56,8 @@ select from_json('[{"a": 1}, 2]', 
'array>');
 select to_json(array('1', '2', '3'));
 select to_json(array(array(1, 2, 3), array(4)));
 
+-- infer schema of json literal using options
+select schema_of_json('{"c1":1}', map('primitivesAsString', 'true'));
+select schema_of_json('{"c1":01, "c2":0.1}', 
map('allowNumericLeadingZeros', 'true', 'prefersDecimal', 'true'));
+
+
--- End diff --

nit: unneeded newline


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22442: [SPARK-25447][SQL] Support JSON options by schema...

2018-09-18 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22442#discussion_r218642815
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala ---
@@ -3611,6 +3611,20 @@ object functions {
*/
   def schema_of_json(e: Column): Column = withExpr(new 
SchemaOfJson(e.expr))
 
+  /**
+   * Parses a column containing a JSON string and infers its schema using 
options.
+   *
+   * @param e a string column containing JSON data.
+   * @param options JSON datasource options that control JSON parsing and 
type inference.
--- End diff --

How about we leave a link and whitelist effective options and throw an 
exception?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22442: [SPARK-25447][SQL] Support JSON options by schema...

2018-09-18 Thread MaxGekk
Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/22442#discussion_r218545335
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala ---
@@ -3611,6 +3611,20 @@ object functions {
*/
   def schema_of_json(e: Column): Column = withExpr(new 
SchemaOfJson(e.expr))
 
+  /**
+   * Parses a column containing a JSON string and infers its schema using 
options.
+   *
+   * @param e a string column containing JSON data.
+   * @param options JSON datasource options that control JSON parsing and 
type inference.
--- End diff --

Sure but not all options can impact on schema inferring. @rxin Maybe it 
makes sense to list the options which I pointed out in the ticket: 
https://issues.apache.org/jira/browse/SPARK-25447 ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22442: [SPARK-25447][SQL] Support JSON options by schema...

2018-09-17 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/22442#discussion_r218250393
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala ---
@@ -3611,6 +3611,20 @@ object functions {
*/
   def schema_of_json(e: Column): Column = withExpr(new 
SchemaOfJson(e.expr))
 
+  /**
+   * Parses a column containing a JSON string and infers its schema using 
options.
+   *
+   * @param e a string column containing JSON data.
+   * @param options JSON datasource options that control JSON parsing and 
type inference.
--- End diff --

maybe you can say refer to DataFrameReader.json for the list of options


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22442: [SPARK-25447][SQL] Support JSON options by schema...

2018-09-17 Thread MaxGekk
GitHub user MaxGekk opened a pull request:

https://github.com/apache/spark/pull/22442

[SPARK-25447][SQL] Support JSON options by schema_of_json()

## What changes were proposed in this pull request?

In the PR, I propose to extended the `schema_of_json()` function, and 
accept JSON options since they can impact on schema inferring. Purpose is to 
support the same options that `from_json` can use during schema inferring.

## How was this patch tested?

Added SQL, Python and Scala tests (`JsonExpressionsSuite` and 
`JsonFunctionsSuite`) that checks JSON options are used.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/MaxGekk/spark-1 schema_of_json-options

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22442.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22442


commit 6a9ec940af1714a603d71f995201c4753b0e06c4
Author: Maxim Gekk 
Date:   2018-09-16T21:06:27Z

Accept options in schema_of_json

commit 68e1438e9bfdf9c4ab2cc68c251308f0463df4ef
Author: Maxim Gekk 
Date:   2018-09-16T21:29:59Z

Fix examples

commit 3365e086f662da859dcd74c0973c7331925c5bcd
Author: Maxim Gekk 
Date:   2018-09-17T10:12:02Z

Added sql tests

commit 62ef168336633e014c1656ff79f17205df6a81d8
Author: Maxim Gekk 
Date:   2018-09-17T11:46:49Z

Added a signature which accepts options

commit 9d3b1a2be52094c13c8543cccf3fc9c8d177e480
Author: Maxim Gekk 
Date:   2018-09-17T12:31:45Z

Support options in PySpark




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org