[ 
https://issues.apache.org/jira/browse/SPARK-46071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46071:
-----------------------------------
    Labels: pull-request-available  (was: )

> TreeNode.toJSON may result in OOM when there are multiple levels of nesting 
> of expressions.
> -------------------------------------------------------------------------------------------
>
>                 Key: SPARK-46071
>                 URL: https://issues.apache.org/jira/browse/SPARK-46071
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.4.0
>            Reporter: JacobZheng
>            Priority: Major
>              Labels: pull-request-available
>
> I am encountering an OOM exception when executing the following code:
> {code:scala}
>     parser.parseExpression(sql).toJSON
> {code}
> This sql is a multiple nesting of {*}_CaseWhen_{*}. After testing I found 
> that the number of expressions in the json increases exponentially as the 
> number of nestings increases.
> Here are some example:
> sql:
> {code:sql}
> CASE WHEN(`cost` <= 275) THEN '(270-275]' 
> ELSE '----' END
> {code}
> json:
> {code:json}
> [
>     {
>         "class":"org.apache.spark.sql.catalyst.expressions.CaseWhen",
>         "num-children":3,
>         "branches":[
>             {
>                 "product-class":"scala.Tuple2",
>                 "_1":[
>                     {
>                         
> "class":"org.apache.spark.sql.catalyst.expressions.LessThanOrEqual",
>                         "num-children":2,
>                         "left":0,
>                         "right":1
>                     },
>                     {
>                         
> "class":"org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute",
>                         "num-children":0,
>                         "nameParts":"[cost]"
>                     },
>                     {
>                         
> "class":"org.apache.spark.sql.catalyst.expressions.Literal",
>                         "num-children":0,
>                         "value":"275",
>                         "dataType":"integer"
>                     }
>                 ],
>                 "_2":[
>                     {
>                         
> "class":"org.apache.spark.sql.catalyst.expressions.Literal",
>                         "num-children":0,
>                         "value":"(270-275]",
>                         "dataType":"string"
>                     }
>                 ]
>             }
>         ],
>         "elseValue":[
>             {
>                 "class":"org.apache.spark.sql.catalyst.expressions.Literal",
>                 "num-children":0,
>                 "value":"----",
>                 "dataType":"string"
>             }
>         ]
>     },
>     {
>         "class":"org.apache.spark.sql.catalyst.expressions.LessThanOrEqual",
>         "num-children":2,
>         "left":0,
>         "right":1
>     },
>     {
>         "class":"org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute",
>         "num-children":0,
>         "nameParts":"[cost]"
>     },
>     {
>         "class":"org.apache.spark.sql.catalyst.expressions.Literal",
>         "num-children":0,
>         "value":"275",
>         "dataType":"integer"
>     },
>     {
>         "class":"org.apache.spark.sql.catalyst.expressions.Literal",
>         "num-children":0,
>         "value":"(270-275]",
>         "dataType":"string"
>     },
>     {
>         "class":"org.apache.spark.sql.catalyst.expressions.Literal",
>         "num-children":0,
>         "value":"----",
>         "dataType":"string"
>     }
> ]
> {code}
> The child nodes of the *_CaseWhen_* expression are stored twice in JSON.
> When *_CaseWhen_* is nested twice, the child expression of the first case 
> when is repeated 4 times, and so on.
> {code:sql}
> CASE WHEN(`cost` <= 270) THEN '(265-270]'
> ELSE 
>     CASE WHEN(`cost` <= 275) THEN '(270-275]' 
>     ELSE '----' END END
> {code}
> Nesting the *_CaseWhen_* expression n times in this case will result in 
> 2^n+11 expressions in the json.
> The reason for this problem is that the field of *_CaseWhen_* cannot be 
> converted to children index of when executing method {*}_jsonFields_{*}.
> Perhaps simplifying *_CaseWhen_* json a bit by overriding the *_jsonFields_* 
> method is a viable way to go.
> {code:json}
> [
>     {
>         "class":"org.apache.spark.sql.catalyst.expressions.CaseWhen",
>         "num-children":3,
>         "branches":[
>             {
>                 "condition":0,
>                 "value":1
>             }
>         ],
>         "elseValue":2
>     },
>     {
>         "class":"org.apache.spark.sql.catalyst.expressions.LessThanOrEqual",
>         "num-children":2,
>         "left":0,
>         "right":1
>     },
>     {
>         "class":"org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute",
>         "num-children":0,
>         "nameParts":"[cost]"
>     },
>     {
>         "class":"org.apache.spark.sql.catalyst.expressions.Literal",
>         "num-children":0,
>         "value":"275",
>         "dataType":"integer"
>     },
>     {
>         "class":"org.apache.spark.sql.catalyst.expressions.Literal",
>         "num-children":0,
>         "value":"(270-275]",
>         "dataType":"string"
>     },
>     {
>         "class":"org.apache.spark.sql.catalyst.expressions.Literal",
>         "num-children":0,
>         "value":"----",
>         "dataType":"string"
>     }
> ]
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to