[jira] [Commented] (SPARK-25195) Extending from_json function
[ https://issues.apache.org/jira/browse/SPARK-25195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16591474#comment-16591474 ] Yuriy Davygora commented on SPARK-25195: I opened tickets [[SPARK-25225]], [[SPARK-25226]] and [[SPARK-25227]] > Extending from_json function > > > Key: SPARK-25195 > URL: https://issues.apache.org/jira/browse/SPARK-25195 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core >Affects Versions: 2.3.1 >Reporter: Yuriy Davygora >Priority: Minor > > Dear Spark and PySpark maintainers, > I hope, that opening a JIRA issue is the correct way to request an > improvement. If it's not, please forgive me and kindly instruct me on how to > do it instead. > At our company, we are currently rewriting a lot of old MapReduce code with > SPARK, and the following use-case is quite frequent: Some string-valued > dataframe columns are JSON-arrays, and we want to parse them into array-typed > columns. > Problem number 1: The from_json function accepts as a schema only > StructType or ArrayType(StructType), but not an ArrayType of primitives. > Submitting the schema in a string form like > {noformat}{"containsNull":true,"elementType":"string","type":"array"}{noformat} > does not work either, the error message says, among other things, > {noformat}data type mismatch: Input schema array must be a struct or > an array of structs.{noformat} > Problem number 2: Sometimes, in our JSON arrays we have elements of > different types. For example, we might have some JSON array like > {noformat}["string_value", 0, true, null]{noformat} which is JSON-valid with > schema > {noformat}{"containsNull":true,"elementType":["string","integer","boolean"],"type":"array"}{noformat} > (and, for instance the Python json.loads function has no problem parsing > this), but such a schema is not recognized, at all. The error message gets > quite unreadable after the words {noformat}ParseException: u'\nmismatched > input{noformat} > Here is some simple Python code to reproduce the problems (using pyspark > 2.3.1 and pandas 0.23.4): > {noformat} > import pandas as pd > from pyspark.sql import SparkSession > import pyspark.sql.functions as F > from pyspark.sql.types import StringType, ArrayType > spark = SparkSession.builder.appName('test').getOrCreate() > data = {'id' : [1,2,3], 'data' : ['["string1", true, null]', '["string2", > false, null]', '["string3", true, "another_string3"]']} > pdf = pd.DataFrame.from_dict(data) > df = spark.createDataFrame(pdf) > df.show() > df = df.withColumn("parsed_data", F.from_json(F.col('data'), > ArrayType(StringType( # Does not work, because not a struct of array > of structs > df = df.withColumn("parsed_data", F.from_json(F.col('data'), > > '{"containsNull":true,"elementType":["string","integer","boolean"],"type":"array"}')) > # Does not work at all > {noformat} > For now, we have to use a UDF function, which calls python's json.loads, > but this is, for obvious reasons, suboptimal. If you could extend the > functionality of the Spark from_json function in the next release, this would > be really helpful. Thank you in advance! > == > UPDATE: By the way, apparently the to_json function has the same problems: it > cannot convert an array-typed column to a JSON-string. It would be nice for > it to support arrays, as well. And, speaking of problem 2, an array column of > different types cannot be even created in the first place. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25195) Extending from_json function
[ https://issues.apache.org/jira/browse/SPARK-25195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16591455#comment-16591455 ] Yuriy Davygora commented on SPARK-25195: OK, I will close this one, and open separate tickets > Extending from_json function > > > Key: SPARK-25195 > URL: https://issues.apache.org/jira/browse/SPARK-25195 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core >Affects Versions: 2.3.1 >Reporter: Yuriy Davygora >Priority: Minor > > Dear Spark and PySpark maintainers, > I hope, that opening a JIRA issue is the correct way to request an > improvement. If it's not, please forgive me and kindly instruct me on how to > do it instead. > At our company, we are currently rewriting a lot of old MapReduce code with > SPARK, and the following use-case is quite frequent: Some string-valued > dataframe columns are JSON-arrays, and we want to parse them into array-typed > columns. > Problem number 1: The from_json function accepts as a schema only > StructType or ArrayType(StructType), but not an ArrayType of primitives. > Submitting the schema in a string form like > {noformat}{"containsNull":true,"elementType":"string","type":"array"}{noformat} > does not work either, the error message says, among other things, > {noformat}data type mismatch: Input schema array must be a struct or > an array of structs.{noformat} > Problem number 2: Sometimes, in our JSON arrays we have elements of > different types. For example, we might have some JSON array like > {noformat}["string_value", 0, true, null]{noformat} which is JSON-valid with > schema > {noformat}{"containsNull":true,"elementType":["string","integer","boolean"],"type":"array"}{noformat} > (and, for instance the Python json.loads function has no problem parsing > this), but such a schema is not recognized, at all. The error message gets > quite unreadable after the words {noformat}ParseException: u'\nmismatched > input{noformat} > Here is some simple Python code to reproduce the problems (using pyspark > 2.3.1 and pandas 0.23.4): > {noformat} > import pandas as pd > from pyspark.sql import SparkSession > import pyspark.sql.functions as F > from pyspark.sql.types import StringType, ArrayType > spark = SparkSession.builder.appName('test').getOrCreate() > data = {'id' : [1,2,3], 'data' : ['["string1", true, null]', '["string2", > false, null]', '["string3", true, "another_string3"]']} > pdf = pd.DataFrame.from_dict(data) > df = spark.createDataFrame(pdf) > df.show() > df = df.withColumn("parsed_data", F.from_json(F.col('data'), > ArrayType(StringType( # Does not work, because not a struct of array > of structs > df = df.withColumn("parsed_data", F.from_json(F.col('data'), > > '{"containsNull":true,"elementType":["string","integer","boolean"],"type":"array"}')) > # Does not work at all > {noformat} > For now, we have to use a UDF function, which calls python's json.loads, > but this is, for obvious reasons, suboptimal. If you could extend the > functionality of the Spark from_json function in the next release, this would > be really helpful. Thank you in advance! > == > UPDATE: By the way, apparently the to_json function has the same problems: it > cannot convert an array-typed column to a JSON-string. It would be nice for > it to support arrays, as well. And, speaking of problem 2, an array column of > different types cannot be even created in the first place. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25195) Extending from_json function
[ https://issues.apache.org/jira/browse/SPARK-25195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16590140#comment-16590140 ] Maxim Gekk commented on SPARK-25195: This is the ticket which combines both from_json/to_json: https://issues.apache.org/jira/browse/SPARK-24391 . It was closed with the PR [https://github.com/apache/spark/pull/21439]. It would be nice to have a separate ticket specifically for to_json. > Extending from_json function > > > Key: SPARK-25195 > URL: https://issues.apache.org/jira/browse/SPARK-25195 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core >Affects Versions: 2.3.1 >Reporter: Yuriy Davygora >Priority: Minor > > Dear Spark and PySpark maintainers, > I hope, that opening a JIRA issue is the correct way to request an > improvement. If it's not, please forgive me and kindly instruct me on how to > do it instead. > At our company, we are currently rewriting a lot of old MapReduce code with > SPARK, and the following use-case is quite frequent: Some string-valued > dataframe columns are JSON-arrays, and we want to parse them into array-typed > columns. > Problem number 1: The from_json function accepts as a schema only > StructType or ArrayType(StructType), but not an ArrayType of primitives. > Submitting the schema in a string form like > {noformat}{"containsNull":true,"elementType":"string","type":"array"}{noformat} > does not work either, the error message says, among other things, > {noformat}data type mismatch: Input schema array must be a struct or > an array of structs.{noformat} > Problem number 2: Sometimes, in our JSON arrays we have elements of > different types. For example, we might have some JSON array like > {noformat}["string_value", 0, true, null]{noformat} which is JSON-valid with > schema > {noformat}{"containsNull":true,"elementType":["string","integer","boolean"],"type":"array"}{noformat} > (and, for instance the Python json.loads function has no problem parsing > this), but such a schema is not recognized, at all. The error message gets > quite unreadable after the words {noformat}ParseException: u'\nmismatched > input{noformat} > Here is some simple Python code to reproduce the problems (using pyspark > 2.3.1 and pandas 0.23.4): > {noformat} > import pandas as pd > from pyspark.sql import SparkSession > import pyspark.sql.functions as F > from pyspark.sql.types import StringType, ArrayType > spark = SparkSession.builder.appName('test').getOrCreate() > data = {'id' : [1,2,3], 'data' : ['["string1", true, null]', '["string2", > false, null]', '["string3", true, "another_string3"]']} > pdf = pd.DataFrame.from_dict(data) > df = spark.createDataFrame(pdf) > df.show() > df = df.withColumn("parsed_data", F.from_json(F.col('data'), > ArrayType(StringType( # Does not work, because not a struct of array > of structs > df = df.withColumn("parsed_data", F.from_json(F.col('data'), > > '{"containsNull":true,"elementType":["string","integer","boolean"],"type":"array"}')) > # Does not work at all > {noformat} > For now, we have to use a UDF function, which calls python's json.loads, > but this is, for obvious reasons, suboptimal. If you could extend the > functionality of the Spark from_json function in the next release, this would > be really helpful. Thank you in advance! > == > UPDATE: By the way, apparently the to_json function has the same problems: it > cannot convert an array-typed column to a JSON-string. It would be nice for > it to support arrays, as well. And, speaking of problem 2, an array column of > different types cannot be even created in the first place. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25195) Extending from_json function
[ https://issues.apache.org/jira/browse/SPARK-25195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16590098#comment-16590098 ] Yuriy Davygora commented on SPARK-25195: By the way, apparently the to_json function has the same problems: it cannot convert an array-typed column to a JSON-string. It would be nice for it to support arrays, as well. > Extending from_json function > > > Key: SPARK-25195 > URL: https://issues.apache.org/jira/browse/SPARK-25195 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core >Affects Versions: 2.3.1 >Reporter: Yuriy Davygora >Priority: Minor > > Dear Spark and PySpark maintainers, > I hope, that opening a JIRA issue is the correct way to request an > improvement. If it's not, please forgive me and kindly instruct me on how to > do it instead. > At our company, we are currently rewriting a lot of old MapReduce code with > SPARK, and the following use-case is quite frequent: Some string-valued > dataframe columns are JSON-arrays, and we want to parse them into array-typed > columns. > Problem number 1: The from_json function accepts as a schema only > StructType or ArrayType(StructType), but not an ArrayType of primitives. > Submitting the schema in a string form like > {noformat}{"containsNull":true,"elementType":"string","type":"array"}{noformat} > does not work either, the error message says, among other things, > {noformat}data type mismatch: Input schema array must be a struct or > an array of structs.{noformat} > Problem number 2: Sometimes, in our JSON arrays we have elements of > different types. For example, we might have some JSON array like > {noformat}["string_value", 0, true, null]{noformat} which is JSON-valid with > schema > {noformat}{"containsNull":true,"elementType":["string","integer","boolean"],"type":"array"}{noformat} > (and, for instance the Python json.loads function has no problem parsing > this), but such a schema is not recognized, at all. The error message gets > quite unreadable after the words {noformat}ParseException: u'\nmismatched > input{noformat} > Here is some simple Python code to reproduce the problems (using pyspark > 2.3.1 and pandas 0.23.4): > {noformat} > import pandas as pd > from pyspark.sql import SparkSession > import pyspark.sql.functions as F > from pyspark.sql.types import StringType, ArrayType > spark = SparkSession.builder.appName('test').getOrCreate() > data = {'id' : [1,2,3], 'data' : ['["string1", true, null]', '["string2", > false, null]', '["string3", true, "another_string3"]']} > pdf = pd.DataFrame.from_dict(data) > df = spark.createDataFrame(pdf) > df.show() > df = df.withColumn("parsed_data", F.from_json(F.col('data'), > ArrayType(StringType( # Does not work, because not a struct of array > of structs > df = df.withColumn("parsed_data", F.from_json(F.col('data'), > > '{"containsNull":true,"elementType":["string","integer","boolean"],"type":"array"}')) > # Does not work at all > {noformat} > For now, we have to use a UDF function, which calls python's json.loads, > but this is, for obvious reasons, suboptimal. If you could extend the > functionality of the Spark from_json function in the next release, this would > be really helpful. Thank you in advance! > == > UPDATE: By the way, apparently the to_json function has the same problems: it > cannot convert an array-typed column to a JSON-string. It would be nice for > it to support arrays, as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25195) Extending from_json function
[ https://issues.apache.org/jira/browse/SPARK-25195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16589850#comment-16589850 ] Maxim Gekk commented on SPARK-25195: > 1. Does this patch also solve problem 2, as described above? No, it doesn't. > 2. Do you know when it will be released? It should be in the upcoming release 2.4. > Extending from_json function > > > Key: SPARK-25195 > URL: https://issues.apache.org/jira/browse/SPARK-25195 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core >Affects Versions: 2.3.1 >Reporter: Yuriy Davygora >Priority: Minor > > Dear Spark and PySpark maintainers, > I hope, that opening a JIRA issue is the correct way to request an > improvement. If it's not, please forgive me and kindly instruct me on how to > do it instead. > At our company, we are currently rewriting a lot of old MapReduce code with > SPARK, and the following use-case is quite frequent: Some string-valued > dataframe columns are JSON-arrays, and we want to parse them into array-typed > columns. > Problem number 1: The from_json function accepts as a schema only > StructType or ArrayType(StructType), but not an ArrayType of primitives. > Submitting the schema in a string form like > {noformat}{"containsNull":true,"elementType":"string","type":"array"}{noformat} > does not work either, the error message says, among other things, > {noformat}data type mismatch: Input schema array must be a struct or > an array of structs.{noformat} > Problem number 2: Sometimes, in our JSON arrays we have elements of > different types. For example, we might have some JSON array like > {noformat}["string_value", 0, true, null]{noformat} which is JSON-valid with > schema > {noformat}{"containsNull":true,"elementType":["string","integer","boolean"],"type":"array"}{noformat} > (and, for instance the Python json.loads function has no problem parsing > this), but such a schema is not recognized, at all. The error message gets > quite unreadable after the words {noformat}ParseException: u'\nmismatched > input{noformat} > Here is some simple Python code to reproduce the problems (using pyspark > 2.3.1 and pandas 0.23.4): > {noformat} > import pandas as pd > from pyspark.sql import SparkSession > import pyspark.sql.functions as F > from pyspark.sql.types import StringType, ArrayType > spark = SparkSession.builder.appName('test').getOrCreate() > data = {'id' : [1,2,3], 'data' : ['["string1", true, null]', '["string2", > false, null]', '["string3", true, "another_string3"]']} > pdf = pd.DataFrame.from_dict(data) > df = spark.createDataFrame(pdf) > df.show() > df = df.withColumn("parsed_data", F.from_json(F.col('data'), > ArrayType(StringType( # Does not work, because not a struct of array > of structs > df = df.withColumn("parsed_data", F.from_json(F.col('data'), > > '{"containsNull":true,"elementType":["string","integer","boolean"],"type":"array"}')) > # Does not work at all > {noformat} > For now, we have to use a UDF function, which calls python's json.loads, > but this is, for obvious reasons, suboptimal. If you could extend the > functionality of the Spark from_json function in the next release, this would > be really helpful. Thank you in advance! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25195) Extending from_json function
[ https://issues.apache.org/jira/browse/SPARK-25195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16589807#comment-16589807 ] Yuriy Davygora commented on SPARK-25195: [~maxgekk] Cool, that's great news! Two questions: 1. Does this patch also solve problem 2, as described above? 2. Do you know when it will be released? > Extending from_json function > > > Key: SPARK-25195 > URL: https://issues.apache.org/jira/browse/SPARK-25195 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core >Affects Versions: 2.3.1 >Reporter: Yuriy Davygora >Priority: Minor > > Dear Spark and PySpark maintainers, > I hope, that opening a JIRA issue is the correct way to request an > improvement. If it's not, please forgive me and kindly instruct me on how to > do it instead. > At our company, we are currently rewriting a lot of old MapReduce code with > SPARK, and the following use-case is quite frequent: Some string-valued > dataframe columns are JSON-arrays, and we want to parse them into array-typed > columns. > Problem number 1: The from_json function accepts as a schema only > StructType or ArrayType(StructType), but not an ArrayType of primitives. > Submitting the schema in a string form like > {noformat}{"containsNull":true,"elementType":"string","type":"array"}{noformat} > does not work either, the error message says, among other things, > {noformat}data type mismatch: Input schema array must be a struct or > an array of structs.{noformat} > Problem number 2: Sometimes, in our JSON arrays we have elements of > different types. For example, we might have some JSON array like > {noformat}["string_value", 0, true, null]{noformat} which is JSON-valid with > schema > {noformat}{"containsNull":true,"elementType":["string","integer","boolean"],"type":"array"}{noformat} > (and, for instance the Python json.loads function has no problem parsing > this), but such a schema is not recognized, at all. The error message gets > quite unreadable after the words {noformat}ParseException: u'\nmismatched > input{noformat} > Here is some simple Python code to reproduce the problems (using pyspark > 2.3.1 and pandas 0.23.4): > {noformat} > import pandas as pd > from pyspark.sql import SparkSession > import pyspark.sql.functions as F > from pyspark.sql.types import StringType, ArrayType > spark = SparkSession.builder.appName('test').getOrCreate() > data = {'id' : [1,2,3], 'data' : ['["string1", true, null]', '["string2", > false, null]', '["string3", true, "another_string3"]']} > pdf = pd.DataFrame.from_dict(data) > df = spark.createDataFrame(pdf) > df.show() > df = df.withColumn("parsed_data", F.from_json(F.col('data'), > ArrayType(StringType( # Does not work, because not a struct of array > of structs > df = df.withColumn("parsed_data", F.from_json(F.col('data'), > > '{"containsNull":true,"elementType":["string","integer","boolean"],"type":"array"}')) > # Does not work at all > {noformat} > For now, we have to use a UDF function, which calls python's json.loads, > but this is, for obvious reasons, suboptimal. If you could extend the > functionality of the Spark from_json function in the next release, this would > be really helpful. Thank you in advance! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25195) Extending from_json function
[ https://issues.apache.org/jira/browse/SPARK-25195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16589335#comment-16589335 ] Maxim Gekk commented on SPARK-25195: > Problem number 1: The from_json function accepts as a schema only StructType > or ArrayType(StructType), but not an ArrayType of primitives. This was fixed recently: https://github.com/apache/spark/pull/21439 > Extending from_json function > > > Key: SPARK-25195 > URL: https://issues.apache.org/jira/browse/SPARK-25195 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core >Affects Versions: 2.3.1 >Reporter: Yuriy Davygora >Priority: Minor > > Dear Spark and PySpark maintainers, > I hope, that opening a JIRA issue is the correct way to request an > improvement. If it's not, please forgive me and kindly instruct me on how to > do it instead. > At our company, we are currently rewriting a lot of old MapReduce code with > SPARK, and the following use-case is quite frequent: Some string-valued > dataframe columns are JSON-arrays, and we want to parse them into array-typed > columns. > Problem number 1: The from_json function accepts as a schema only > StructType or ArrayType(StructType), but not an ArrayType of primitives. > Submitting the schema in a string form like > {noformat}{"containsNull":true,"elementType":"string","type":"array"}{noformat} > does not work either, the error message says, among other things, > {noformat}data type mismatch: Input schema array must be a struct or > an array of structs.{noformat} > Problem number 2: Sometimes, in our JSON arrays we have elements of > different types. For example, we might have some JSON array like > {noformat}["string_value", 0, true, null]{noformat} which is JSON-valid with > schema > {noformat}{"containsNull":true,"elementType":["string","integer","boolean"],"type":"array"}{noformat} > (and, for instance the Python json.loads function has no problem parsing > this), but such a schema is not recognized, at all. The error message gets > quite unreadable after the words {noformat}ParseException: u'\nmismatched > input{noformat} > Here is some simple Python code to reproduce the problems (using pyspark > 2.3.1 and pandas 0.23.4): > {noformat} > import pandas as pd > from pyspark.sql import SparkSession > import pyspark.sql.functions as F > from pyspark.sql.types import StringType, ArrayType > spark = SparkSession.builder.appName('test').getOrCreate() > data = {'id' : [1,2,3], 'data' : ['["string1", true, null]', '["string2", > false, null]', '["string3", true, "another_string3"]']} > pdf = pd.DataFrame.from_dict(data) > df = spark.createDataFrame(pdf) > df.show() > df = df.withColumn("parsed_data", F.from_json(F.col('data'), > ArrayType(StringType( # Does not work, because not a struct of array > of structs > df = df.withColumn("parsed_data", F.from_json(F.col('data'), > > '{"containsNull":true,"elementType":["string","integer","boolean"],"type":"array"}')) > # Does not work at all > {noformat} > For now, we have to use a UDF function, which calls python's json.loads, > but this is, for obvious reasons, suboptimal. If you could extend the > functionality of the Spark from_json function in the next release, this would > be really helpful. Thank you in advance! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org