[ https://issues.apache.org/jira/browse/SPARK-46778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17823371#comment-17823371 ]
Pablo Langa Blanco commented on SPARK-46778: -------------------------------------------- I was looking at it and I found a comment in the code that explain why this behavior ([https://github.com/apache/spark/blob/35bced42474e3221cf61d13a142c3c5470df1f22/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L377]) There are some tests around the code that test it and I reproduced it in hive 3.1.3 and it still maintains this behavior so I don't know if we can change it. > get_json_object flattens wildcard queries that match a single value > ------------------------------------------------------------------- > > Key: SPARK-46778 > URL: https://issues.apache.org/jira/browse/SPARK-46778 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.4.1 > Reporter: Robert Joseph Evans > Priority: Major > > I think this impacts all versions of {{{}get_json_object{}}}, but I am not > 100% sure. > The unit test for > [$.store.book[*].reader|https://github.com/apache/spark/blob/39f8e1a5953b5897f893151d24dc585a80c0c8a0/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/JsonExpressionsSuite.scala#L142-L146] > verifies that the output of this query is a single level JSON array, but > when I put the same JSON and JSON path into [http://jsonpath.com/] I get a > result with multiple levels of nesting. It looks like Apache Spark tries to > flatten lists for {{[*]}} matches when there is only a single element that > matches. > {code:java} > scala> > Seq("""[{"a":"A"},{"b":"B"}]""","""[{"a":"A"},{"a":"B"}]""").toDF("jsonStr").selectExpr("""get_json_object(jsonStr,"$[*].a")""").show(false) > +--------------------------------+ > |get_json_object(jsonStr, $[*].a)| > +--------------------------------+ > |"A" | > |["A","B"] | > +--------------------------------+ {code} > But this has problems in that I no longer have a consistent schema returned, > even if the input schema is known to be consistent. For example if I wanted > to know how many elements matched this query I could wrap it in a > {{json_array_length}} but that does not work in the generic case. > {code:java} > scala> > Seq("""[{"a":"A"},{"b":"B"}]""","""[{"a":"A"},{"a":"B"}]""").toDF("jsonStr").selectExpr("""json_array_length(get_json_object(jsonStr,"$[*].a"))""").show(false) > +---------------------------------------------------+ > |json_array_length(get_json_object(jsonStr, $[*].a))| > +---------------------------------------------------+ > |null | > |2 | > +---------------------------------------------------+ {code} > If the value returned might be a JSON array, and then I would get a number, > but it is wrong. > {code:java} > scala> > Seq("""[{"a":[1,2,3,4,5]},{"b":"B"}]""","""[{"a":[1,2,3,4,5]},{"a":[1,2,3,4,5]}]""").toDF("jsonStr").selectExpr("""json_array_length(get_json_object(jsonStr,"$[*].a"))""").show(false) > +---------------------------------------------------+ > |json_array_length(get_json_object(jsonStr, $[*].a))| > +---------------------------------------------------+ > |5 | > |2 | > +---------------------------------------------------+ {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org