[jira] [Commented] (SPARK-25195) Extending from_json function

2018-08-24 Thread Yuriy Davygora (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16591474#comment-16591474
 ] 

Yuriy Davygora commented on SPARK-25195:


I opened tickets [[SPARK-25225]], [[SPARK-25226]] and [[SPARK-25227]]

> Extending from_json function
> 
>
> Key: SPARK-25195
> URL: https://issues.apache.org/jira/browse/SPARK-25195
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 2.3.1
>Reporter: Yuriy Davygora
>Priority: Minor
>
>   Dear Spark and PySpark maintainers,
>   I hope, that opening a JIRA issue is the correct way to request an 
> improvement. If it's not, please forgive me and kindly instruct me on how to 
> do it instead.
>   At our company, we are currently rewriting a lot of old MapReduce code with 
> SPARK, and the following use-case is quite frequent: Some string-valued 
> dataframe columns are JSON-arrays, and we want to parse them into array-typed 
> columns.
>   Problem number 1: The from_json function accepts as a schema only 
> StructType or ArrayType(StructType), but not an ArrayType of primitives. 
> Submitting the schema in a string form like 
> {noformat}{"containsNull":true,"elementType":"string","type":"array"}{noformat}
>  does not work either, the error message says, among other things, 
> {noformat}data type mismatch: Input schema array must be a struct or 
> an array of structs.{noformat}
>   Problem number 2: Sometimes, in our JSON arrays we have elements of 
> different types. For example, we might have some JSON array like 
> {noformat}["string_value", 0, true, null]{noformat} which is JSON-valid with 
> schema 
> {noformat}{"containsNull":true,"elementType":["string","integer","boolean"],"type":"array"}{noformat}
>  (and, for instance the Python json.loads function has no problem parsing 
> this), but such a schema is not recognized, at all. The error message gets 
> quite unreadable after the words {noformat}ParseException: u'\nmismatched 
> input{noformat}
>   Here is some simple Python code to reproduce the problems (using pyspark 
> 2.3.1 and pandas 0.23.4):
>   {noformat}
> import pandas as pd
> from pyspark.sql import SparkSession
> import pyspark.sql.functions as F
> from pyspark.sql.types import StringType, ArrayType
> spark = SparkSession.builder.appName('test').getOrCreate()
> data = {'id' : [1,2,3], 'data' : ['["string1", true, null]', '["string2", 
> false, null]', '["string3", true, "another_string3"]']}
> pdf = pd.DataFrame.from_dict(data)
> df = spark.createDataFrame(pdf)
> df.show()
> df = df.withColumn("parsed_data", F.from_json(F.col('data'),
> ArrayType(StringType( # Does not work, because not a struct of array 
> of structs
> df = df.withColumn("parsed_data", F.from_json(F.col('data'),
> 
> '{"containsNull":true,"elementType":["string","integer","boolean"],"type":"array"}'))
>  # Does not work at all
>   {noformat}
>   For now, we have to use a UDF function, which calls python's json.loads, 
> but this is, for obvious reasons, suboptimal. If you could extend the 
> functionality of the Spark from_json function in the next release, this would 
> be really helpful. Thank you in advance!
> ==
> UPDATE: By the way, apparently the to_json function has the same problems: it 
> cannot convert an array-typed column to a JSON-string. It would be nice for 
> it to support arrays, as well. And, speaking of problem 2, an array column of 
> different types cannot be even created in the first place.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25195) Extending from_json function

2018-08-24 Thread Yuriy Davygora (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16591455#comment-16591455
 ] 

Yuriy Davygora commented on SPARK-25195:


OK, I will close this one, and open separate tickets

> Extending from_json function
> 
>
> Key: SPARK-25195
> URL: https://issues.apache.org/jira/browse/SPARK-25195
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 2.3.1
>Reporter: Yuriy Davygora
>Priority: Minor
>
>   Dear Spark and PySpark maintainers,
>   I hope, that opening a JIRA issue is the correct way to request an 
> improvement. If it's not, please forgive me and kindly instruct me on how to 
> do it instead.
>   At our company, we are currently rewriting a lot of old MapReduce code with 
> SPARK, and the following use-case is quite frequent: Some string-valued 
> dataframe columns are JSON-arrays, and we want to parse them into array-typed 
> columns.
>   Problem number 1: The from_json function accepts as a schema only 
> StructType or ArrayType(StructType), but not an ArrayType of primitives. 
> Submitting the schema in a string form like 
> {noformat}{"containsNull":true,"elementType":"string","type":"array"}{noformat}
>  does not work either, the error message says, among other things, 
> {noformat}data type mismatch: Input schema array must be a struct or 
> an array of structs.{noformat}
>   Problem number 2: Sometimes, in our JSON arrays we have elements of 
> different types. For example, we might have some JSON array like 
> {noformat}["string_value", 0, true, null]{noformat} which is JSON-valid with 
> schema 
> {noformat}{"containsNull":true,"elementType":["string","integer","boolean"],"type":"array"}{noformat}
>  (and, for instance the Python json.loads function has no problem parsing 
> this), but such a schema is not recognized, at all. The error message gets 
> quite unreadable after the words {noformat}ParseException: u'\nmismatched 
> input{noformat}
>   Here is some simple Python code to reproduce the problems (using pyspark 
> 2.3.1 and pandas 0.23.4):
>   {noformat}
> import pandas as pd
> from pyspark.sql import SparkSession
> import pyspark.sql.functions as F
> from pyspark.sql.types import StringType, ArrayType
> spark = SparkSession.builder.appName('test').getOrCreate()
> data = {'id' : [1,2,3], 'data' : ['["string1", true, null]', '["string2", 
> false, null]', '["string3", true, "another_string3"]']}
> pdf = pd.DataFrame.from_dict(data)
> df = spark.createDataFrame(pdf)
> df.show()
> df = df.withColumn("parsed_data", F.from_json(F.col('data'),
> ArrayType(StringType( # Does not work, because not a struct of array 
> of structs
> df = df.withColumn("parsed_data", F.from_json(F.col('data'),
> 
> '{"containsNull":true,"elementType":["string","integer","boolean"],"type":"array"}'))
>  # Does not work at all
>   {noformat}
>   For now, we have to use a UDF function, which calls python's json.loads, 
> but this is, for obvious reasons, suboptimal. If you could extend the 
> functionality of the Spark from_json function in the next release, this would 
> be really helpful. Thank you in advance!
> ==
> UPDATE: By the way, apparently the to_json function has the same problems: it 
> cannot convert an array-typed column to a JSON-string. It would be nice for 
> it to support arrays, as well. And, speaking of problem 2, an array column of 
> different types cannot be even created in the first place.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25195) Extending from_json function

2018-08-23 Thread Maxim Gekk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16590140#comment-16590140
 ] 

Maxim Gekk commented on SPARK-25195:


This is the ticket which combines both from_json/to_json: 
https://issues.apache.org/jira/browse/SPARK-24391 . It was closed with the PR 
[https://github.com/apache/spark/pull/21439]. It would be nice to have a 
separate ticket specifically for to_json.

> Extending from_json function
> 
>
> Key: SPARK-25195
> URL: https://issues.apache.org/jira/browse/SPARK-25195
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 2.3.1
>Reporter: Yuriy Davygora
>Priority: Minor
>
>   Dear Spark and PySpark maintainers,
>   I hope, that opening a JIRA issue is the correct way to request an 
> improvement. If it's not, please forgive me and kindly instruct me on how to 
> do it instead.
>   At our company, we are currently rewriting a lot of old MapReduce code with 
> SPARK, and the following use-case is quite frequent: Some string-valued 
> dataframe columns are JSON-arrays, and we want to parse them into array-typed 
> columns.
>   Problem number 1: The from_json function accepts as a schema only 
> StructType or ArrayType(StructType), but not an ArrayType of primitives. 
> Submitting the schema in a string form like 
> {noformat}{"containsNull":true,"elementType":"string","type":"array"}{noformat}
>  does not work either, the error message says, among other things, 
> {noformat}data type mismatch: Input schema array must be a struct or 
> an array of structs.{noformat}
>   Problem number 2: Sometimes, in our JSON arrays we have elements of 
> different types. For example, we might have some JSON array like 
> {noformat}["string_value", 0, true, null]{noformat} which is JSON-valid with 
> schema 
> {noformat}{"containsNull":true,"elementType":["string","integer","boolean"],"type":"array"}{noformat}
>  (and, for instance the Python json.loads function has no problem parsing 
> this), but such a schema is not recognized, at all. The error message gets 
> quite unreadable after the words {noformat}ParseException: u'\nmismatched 
> input{noformat}
>   Here is some simple Python code to reproduce the problems (using pyspark 
> 2.3.1 and pandas 0.23.4):
>   {noformat}
> import pandas as pd
> from pyspark.sql import SparkSession
> import pyspark.sql.functions as F
> from pyspark.sql.types import StringType, ArrayType
> spark = SparkSession.builder.appName('test').getOrCreate()
> data = {'id' : [1,2,3], 'data' : ['["string1", true, null]', '["string2", 
> false, null]', '["string3", true, "another_string3"]']}
> pdf = pd.DataFrame.from_dict(data)
> df = spark.createDataFrame(pdf)
> df.show()
> df = df.withColumn("parsed_data", F.from_json(F.col('data'),
> ArrayType(StringType( # Does not work, because not a struct of array 
> of structs
> df = df.withColumn("parsed_data", F.from_json(F.col('data'),
> 
> '{"containsNull":true,"elementType":["string","integer","boolean"],"type":"array"}'))
>  # Does not work at all
>   {noformat}
>   For now, we have to use a UDF function, which calls python's json.loads, 
> but this is, for obvious reasons, suboptimal. If you could extend the 
> functionality of the Spark from_json function in the next release, this would 
> be really helpful. Thank you in advance!
> ==
> UPDATE: By the way, apparently the to_json function has the same problems: it 
> cannot convert an array-typed column to a JSON-string. It would be nice for 
> it to support arrays, as well. And, speaking of problem 2, an array column of 
> different types cannot be even created in the first place.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25195) Extending from_json function

2018-08-23 Thread Yuriy Davygora (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16590098#comment-16590098
 ] 

Yuriy Davygora commented on SPARK-25195:


By the way, apparently the to_json function has the same problems: it cannot 
convert an array-typed column to a JSON-string. It would be nice for it to 
support arrays, as well.

> Extending from_json function
> 
>
> Key: SPARK-25195
> URL: https://issues.apache.org/jira/browse/SPARK-25195
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 2.3.1
>Reporter: Yuriy Davygora
>Priority: Minor
>
>   Dear Spark and PySpark maintainers,
>   I hope, that opening a JIRA issue is the correct way to request an 
> improvement. If it's not, please forgive me and kindly instruct me on how to 
> do it instead.
>   At our company, we are currently rewriting a lot of old MapReduce code with 
> SPARK, and the following use-case is quite frequent: Some string-valued 
> dataframe columns are JSON-arrays, and we want to parse them into array-typed 
> columns.
>   Problem number 1: The from_json function accepts as a schema only 
> StructType or ArrayType(StructType), but not an ArrayType of primitives. 
> Submitting the schema in a string form like 
> {noformat}{"containsNull":true,"elementType":"string","type":"array"}{noformat}
>  does not work either, the error message says, among other things, 
> {noformat}data type mismatch: Input schema array must be a struct or 
> an array of structs.{noformat}
>   Problem number 2: Sometimes, in our JSON arrays we have elements of 
> different types. For example, we might have some JSON array like 
> {noformat}["string_value", 0, true, null]{noformat} which is JSON-valid with 
> schema 
> {noformat}{"containsNull":true,"elementType":["string","integer","boolean"],"type":"array"}{noformat}
>  (and, for instance the Python json.loads function has no problem parsing 
> this), but such a schema is not recognized, at all. The error message gets 
> quite unreadable after the words {noformat}ParseException: u'\nmismatched 
> input{noformat}
>   Here is some simple Python code to reproduce the problems (using pyspark 
> 2.3.1 and pandas 0.23.4):
>   {noformat}
> import pandas as pd
> from pyspark.sql import SparkSession
> import pyspark.sql.functions as F
> from pyspark.sql.types import StringType, ArrayType
> spark = SparkSession.builder.appName('test').getOrCreate()
> data = {'id' : [1,2,3], 'data' : ['["string1", true, null]', '["string2", 
> false, null]', '["string3", true, "another_string3"]']}
> pdf = pd.DataFrame.from_dict(data)
> df = spark.createDataFrame(pdf)
> df.show()
> df = df.withColumn("parsed_data", F.from_json(F.col('data'),
> ArrayType(StringType( # Does not work, because not a struct of array 
> of structs
> df = df.withColumn("parsed_data", F.from_json(F.col('data'),
> 
> '{"containsNull":true,"elementType":["string","integer","boolean"],"type":"array"}'))
>  # Does not work at all
>   {noformat}
>   For now, we have to use a UDF function, which calls python's json.loads, 
> but this is, for obvious reasons, suboptimal. If you could extend the 
> functionality of the Spark from_json function in the next release, this would 
> be really helpful. Thank you in advance!
> ==
> UPDATE: By the way, apparently the to_json function has the same problems: it 
> cannot convert an array-typed column to a JSON-string. It would be nice for 
> it to support arrays, as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25195) Extending from_json function

2018-08-23 Thread Maxim Gekk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16589850#comment-16589850
 ] 

Maxim Gekk commented on SPARK-25195:


> 1. Does this patch also solve problem 2, as described above?
No, it doesn't.

> 2. Do you know when it will be released?
It should be in the upcoming release 2.4.

> Extending from_json function
> 
>
> Key: SPARK-25195
> URL: https://issues.apache.org/jira/browse/SPARK-25195
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 2.3.1
>Reporter: Yuriy Davygora
>Priority: Minor
>
>   Dear Spark and PySpark maintainers,
>   I hope, that opening a JIRA issue is the correct way to request an 
> improvement. If it's not, please forgive me and kindly instruct me on how to 
> do it instead.
>   At our company, we are currently rewriting a lot of old MapReduce code with 
> SPARK, and the following use-case is quite frequent: Some string-valued 
> dataframe columns are JSON-arrays, and we want to parse them into array-typed 
> columns.
>   Problem number 1: The from_json function accepts as a schema only 
> StructType or ArrayType(StructType), but not an ArrayType of primitives. 
> Submitting the schema in a string form like 
> {noformat}{"containsNull":true,"elementType":"string","type":"array"}{noformat}
>  does not work either, the error message says, among other things, 
> {noformat}data type mismatch: Input schema array must be a struct or 
> an array of structs.{noformat}
>   Problem number 2: Sometimes, in our JSON arrays we have elements of 
> different types. For example, we might have some JSON array like 
> {noformat}["string_value", 0, true, null]{noformat} which is JSON-valid with 
> schema 
> {noformat}{"containsNull":true,"elementType":["string","integer","boolean"],"type":"array"}{noformat}
>  (and, for instance the Python json.loads function has no problem parsing 
> this), but such a schema is not recognized, at all. The error message gets 
> quite unreadable after the words {noformat}ParseException: u'\nmismatched 
> input{noformat}
>   Here is some simple Python code to reproduce the problems (using pyspark 
> 2.3.1 and pandas 0.23.4):
>   {noformat}
> import pandas as pd
> from pyspark.sql import SparkSession
> import pyspark.sql.functions as F
> from pyspark.sql.types import StringType, ArrayType
> spark = SparkSession.builder.appName('test').getOrCreate()
> data = {'id' : [1,2,3], 'data' : ['["string1", true, null]', '["string2", 
> false, null]', '["string3", true, "another_string3"]']}
> pdf = pd.DataFrame.from_dict(data)
> df = spark.createDataFrame(pdf)
> df.show()
> df = df.withColumn("parsed_data", F.from_json(F.col('data'),
> ArrayType(StringType( # Does not work, because not a struct of array 
> of structs
> df = df.withColumn("parsed_data", F.from_json(F.col('data'),
> 
> '{"containsNull":true,"elementType":["string","integer","boolean"],"type":"array"}'))
>  # Does not work at all
>   {noformat}
>   For now, we have to use a UDF function, which calls python's json.loads, 
> but this is, for obvious reasons, suboptimal. If you could extend the 
> functionality of the Spark from_json function in the next release, this would 
> be really helpful. Thank you in advance!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25195) Extending from_json function

2018-08-23 Thread Yuriy Davygora (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16589807#comment-16589807
 ] 

Yuriy Davygora commented on SPARK-25195:


[~maxgekk] Cool, that's great news!

Two questions:

1. Does this patch also solve problem 2, as described above?
2. Do you know when it will be released?


> Extending from_json function
> 
>
> Key: SPARK-25195
> URL: https://issues.apache.org/jira/browse/SPARK-25195
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 2.3.1
>Reporter: Yuriy Davygora
>Priority: Minor
>
>   Dear Spark and PySpark maintainers,
>   I hope, that opening a JIRA issue is the correct way to request an 
> improvement. If it's not, please forgive me and kindly instruct me on how to 
> do it instead.
>   At our company, we are currently rewriting a lot of old MapReduce code with 
> SPARK, and the following use-case is quite frequent: Some string-valued 
> dataframe columns are JSON-arrays, and we want to parse them into array-typed 
> columns.
>   Problem number 1: The from_json function accepts as a schema only 
> StructType or ArrayType(StructType), but not an ArrayType of primitives. 
> Submitting the schema in a string form like 
> {noformat}{"containsNull":true,"elementType":"string","type":"array"}{noformat}
>  does not work either, the error message says, among other things, 
> {noformat}data type mismatch: Input schema array must be a struct or 
> an array of structs.{noformat}
>   Problem number 2: Sometimes, in our JSON arrays we have elements of 
> different types. For example, we might have some JSON array like 
> {noformat}["string_value", 0, true, null]{noformat} which is JSON-valid with 
> schema 
> {noformat}{"containsNull":true,"elementType":["string","integer","boolean"],"type":"array"}{noformat}
>  (and, for instance the Python json.loads function has no problem parsing 
> this), but such a schema is not recognized, at all. The error message gets 
> quite unreadable after the words {noformat}ParseException: u'\nmismatched 
> input{noformat}
>   Here is some simple Python code to reproduce the problems (using pyspark 
> 2.3.1 and pandas 0.23.4):
>   {noformat}
> import pandas as pd
> from pyspark.sql import SparkSession
> import pyspark.sql.functions as F
> from pyspark.sql.types import StringType, ArrayType
> spark = SparkSession.builder.appName('test').getOrCreate()
> data = {'id' : [1,2,3], 'data' : ['["string1", true, null]', '["string2", 
> false, null]', '["string3", true, "another_string3"]']}
> pdf = pd.DataFrame.from_dict(data)
> df = spark.createDataFrame(pdf)
> df.show()
> df = df.withColumn("parsed_data", F.from_json(F.col('data'),
> ArrayType(StringType( # Does not work, because not a struct of array 
> of structs
> df = df.withColumn("parsed_data", F.from_json(F.col('data'),
> 
> '{"containsNull":true,"elementType":["string","integer","boolean"],"type":"array"}'))
>  # Does not work at all
>   {noformat}
>   For now, we have to use a UDF function, which calls python's json.loads, 
> but this is, for obvious reasons, suboptimal. If you could extend the 
> functionality of the Spark from_json function in the next release, this would 
> be really helpful. Thank you in advance!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25195) Extending from_json function

2018-08-22 Thread Maxim Gekk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16589335#comment-16589335
 ] 

Maxim Gekk commented on SPARK-25195:


> Problem number 1: The from_json function accepts as a schema only StructType 
> or ArrayType(StructType), but not an ArrayType of primitives.

This was fixed recently: https://github.com/apache/spark/pull/21439

> Extending from_json function
> 
>
> Key: SPARK-25195
> URL: https://issues.apache.org/jira/browse/SPARK-25195
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 2.3.1
>Reporter: Yuriy Davygora
>Priority: Minor
>
>   Dear Spark and PySpark maintainers,
>   I hope, that opening a JIRA issue is the correct way to request an 
> improvement. If it's not, please forgive me and kindly instruct me on how to 
> do it instead.
>   At our company, we are currently rewriting a lot of old MapReduce code with 
> SPARK, and the following use-case is quite frequent: Some string-valued 
> dataframe columns are JSON-arrays, and we want to parse them into array-typed 
> columns.
>   Problem number 1: The from_json function accepts as a schema only 
> StructType or ArrayType(StructType), but not an ArrayType of primitives. 
> Submitting the schema in a string form like 
> {noformat}{"containsNull":true,"elementType":"string","type":"array"}{noformat}
>  does not work either, the error message says, among other things, 
> {noformat}data type mismatch: Input schema array must be a struct or 
> an array of structs.{noformat}
>   Problem number 2: Sometimes, in our JSON arrays we have elements of 
> different types. For example, we might have some JSON array like 
> {noformat}["string_value", 0, true, null]{noformat} which is JSON-valid with 
> schema 
> {noformat}{"containsNull":true,"elementType":["string","integer","boolean"],"type":"array"}{noformat}
>  (and, for instance the Python json.loads function has no problem parsing 
> this), but such a schema is not recognized, at all. The error message gets 
> quite unreadable after the words {noformat}ParseException: u'\nmismatched 
> input{noformat}
>   Here is some simple Python code to reproduce the problems (using pyspark 
> 2.3.1 and pandas 0.23.4):
>   {noformat}
> import pandas as pd
> from pyspark.sql import SparkSession
> import pyspark.sql.functions as F
> from pyspark.sql.types import StringType, ArrayType
> spark = SparkSession.builder.appName('test').getOrCreate()
> data = {'id' : [1,2,3], 'data' : ['["string1", true, null]', '["string2", 
> false, null]', '["string3", true, "another_string3"]']}
> pdf = pd.DataFrame.from_dict(data)
> df = spark.createDataFrame(pdf)
> df.show()
> df = df.withColumn("parsed_data", F.from_json(F.col('data'),
> ArrayType(StringType( # Does not work, because not a struct of array 
> of structs
> df = df.withColumn("parsed_data", F.from_json(F.col('data'),
> 
> '{"containsNull":true,"elementType":["string","integer","boolean"],"type":"array"}'))
>  # Does not work at all
>   {noformat}
>   For now, we have to use a UDF function, which calls python's json.loads, 
> but this is, for obvious reasons, suboptimal. If you could extend the 
> functionality of the Spark from_json function in the next release, this would 
> be really helpful. Thank you in advance!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org