[ 
https://issues.apache.org/jira/browse/SPARK-46778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17823371#comment-17823371
 ] 

Pablo Langa Blanco commented on SPARK-46778:
--------------------------------------------

I was looking at it and I found a comment in the code that explain why this 
behavior 
([https://github.com/apache/spark/blob/35bced42474e3221cf61d13a142c3c5470df1f22/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L377])
 

There are some tests around the code that test it and I reproduced it in hive 
3.1.3 and it still maintains this behavior so I don't know if we can change it.

> get_json_object flattens wildcard queries that match a single value
> -------------------------------------------------------------------
>
>                 Key: SPARK-46778
>                 URL: https://issues.apache.org/jira/browse/SPARK-46778
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.4.1
>            Reporter: Robert Joseph Evans
>            Priority: Major
>
> I think this impacts all versions of {{{}get_json_object{}}}, but I am not 
> 100% sure.
> The unit test for 
> [$.store.book[*].reader|https://github.com/apache/spark/blob/39f8e1a5953b5897f893151d24dc585a80c0c8a0/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/JsonExpressionsSuite.scala#L142-L146]
>  verifies that the output of this query is a single level JSON array, but 
> when I put the same JSON and JSON path into [http://jsonpath.com/] I get a 
> result with multiple levels of nesting. It looks like Apache Spark tries to 
> flatten lists for {{[*]}} matches when there is only a single element that 
> matches.
> {code:java}
> scala> 
> Seq("""[{"a":"A"},{"b":"B"}]""","""[{"a":"A"},{"a":"B"}]""").toDF("jsonStr").selectExpr("""get_json_object(jsonStr,"$[*].a")""").show(false)
> +--------------------------------+
> |get_json_object(jsonStr, $[*].a)|
> +--------------------------------+
> |"A"                             |
> |["A","B"]                       |
> +--------------------------------+ {code}
> But this has problems in that I no longer have a consistent schema returned, 
> even if the input schema is known to be consistent. For example if I wanted 
> to know how many elements matched this query I could wrap it in a 
> {{json_array_length}} but that does not work in the generic case.
> {code:java}
> scala> 
> Seq("""[{"a":"A"},{"b":"B"}]""","""[{"a":"A"},{"a":"B"}]""").toDF("jsonStr").selectExpr("""json_array_length(get_json_object(jsonStr,"$[*].a"))""").show(false)
> +---------------------------------------------------+
> |json_array_length(get_json_object(jsonStr, $[*].a))|
> +---------------------------------------------------+
> |null                                               |
> |2                                                  |
> +---------------------------------------------------+ {code}
> If the value returned might be a JSON array, and then I would get a number, 
> but it is wrong.
> {code:java}
> scala> 
> Seq("""[{"a":[1,2,3,4,5]},{"b":"B"}]""","""[{"a":[1,2,3,4,5]},{"a":[1,2,3,4,5]}]""").toDF("jsonStr").selectExpr("""json_array_length(get_json_object(jsonStr,"$[*].a"))""").show(false)
> +---------------------------------------------------+
> |json_array_length(get_json_object(jsonStr, $[*].a))|
> +---------------------------------------------------+
> |5                                                  |
> |2                                                  |
> +---------------------------------------------------+ {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to