[jira] [Updated] (SPARK-49863) NormalizeFloatingNumbers degrades nullability of expressions with nested structs
[ https://issues.apache.org/jira/browse/SPARK-49863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nikhil Sheoran updated SPARK-49863: --- Description: For an AttributeReference where the dataType is a nested struct such that the internal struct requires normalization (has a floating type expression), we end up not correctly propagating the nullability of an expression. For example, for an expression like: {code:java} namedStruct("struct", namedStruct("double", ) {code} The dataType prior to normalization is: {code:java} StructType(StructField("struct", StructType(StructField("double", DoubleType, true, {})), false, {})){code} whereas post-normalization, the dataType becomes: {code:java} StructType(StructField("struct", StructType(StructField("double", DoubleType, true, {})), true, {})) {code} We ended up converting the `nullable` attribute of the "double" field from `false` to `true`. was: For an AttributeReference where the dataType is a nested struct such that the internal struct requires normalization (has a floating type expression), we end up not correctly propagating the nullability of an expression. For example, for an expression like: {code:java} namedStruct("struct", namedStruct("double", ) {code} The dataType prior to normalization is: {code:java} StructType(StructField("struct", StructType(StructField("double", DoubleType, true, {})), false, {})){code} whereas post-normalization, the dataType becomes: {code:java} StructType(StructField("struct", StructType(StructField("double", DoubleType, true, {})), true, {})) {code} Essentially, we ended up converting the `nullable` attribute of "double" field from `false` to `true`. > NormalizeFloatingNumbers degrades nullability of expressions with nested > structs > > > Key: SPARK-49863 > URL: https://issues.apache.org/jira/browse/SPARK-49863 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 >Reporter: Nikhil Sheoran >Priority: Major > > For an AttributeReference where the dataType is a nested struct such that the > internal struct requires normalization (has a floating type expression), we > end up not correctly propagating the nullability of an expression. > For example, for an expression like: > {code:java} > namedStruct("struct", namedStruct("double", ) {code} > The dataType prior to normalization is: > {code:java} > StructType(StructField("struct", StructType(StructField("double", DoubleType, > true, {})), false, {})){code} > whereas post-normalization, the dataType becomes: > {code:java} > StructType(StructField("struct", StructType(StructField("double", DoubleType, > true, {})), true, {})) > {code} > > We ended up converting the `nullable` attribute of the "double" field from > `false` to `true`. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-49863) NormalizeFloatingNumbers degrades nullability of expressions with nested structs
[ https://issues.apache.org/jira/browse/SPARK-49863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nikhil Sheoran updated SPARK-49863: --- Description: For an AttributeReference where the dataType is a nested struct such that the internal struct requires normalization (has a floating type expression), we end up not correctly propagating the nullability of an expression. For example, for an expression like: {code:java} namedStruct("struct", namedStruct("double", ) {code} The dataType prior to normalization is: {code:java} StructType(StructField("struct", StructType(StructField("double", DoubleType, true, {})), false, {})){code} whereas post-normalization, the dataType becomes: {code:java} StructType(StructField("struct", StructType(StructField("double", DoubleType, true, {})), true, {})) {code} Essentially, we ended up converting the `nullable` attribute of "double" field from `false` to `true`. was: For an AttributeReference where the dataType is a nested struct such that the internal struct requires normalization (has a floating type expression), we end up not correctly propagating the nullability of an expression. For example, for an expression like: ``` namedStruct("struct", namedStruct("double", ) ``` The dataType prior to normalization is: ``` StructType(StructField("struct", StructType(StructField("double", DoubleType, true, {})), false, {})) ``` whereas post-normalization, the dataType becomes: ``` StructType(StructField("struct", StructType(StructField("double", DoubleType, true, {})), true, {})) ``` Essentially, we ended up converting the `nullable` attribute of "double" field from `false` to `true`. > NormalizeFloatingNumbers degrades nullability of expressions with nested > structs > > > Key: SPARK-49863 > URL: https://issues.apache.org/jira/browse/SPARK-49863 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 >Reporter: Nikhil Sheoran >Priority: Major > > For an AttributeReference where the dataType is a nested struct such that the > internal struct requires normalization (has a floating type expression), we > end up not correctly propagating the nullability of an expression. > For example, for an expression like: > {code:java} > namedStruct("struct", namedStruct("double", ) {code} > > The dataType prior to normalization is: > {code:java} > StructType(StructField("struct", StructType(StructField("double", DoubleType, > true, {})), false, {})){code} > whereas post-normalization, the dataType becomes: > > {code:java} > StructType(StructField("struct", StructType(StructField("double", DoubleType, > true, {})), true, {})) > {code} > Essentially, we ended up converting the `nullable` attribute of "double" > field from `false` to `true`. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-49863) NormalizeFloatingNumbers degrades nullability of expressions with nested structs
Nikhil Sheoran created SPARK-49863: -- Summary: NormalizeFloatingNumbers degrades nullability of expressions with nested structs Key: SPARK-49863 URL: https://issues.apache.org/jira/browse/SPARK-49863 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 4.0.0 Reporter: Nikhil Sheoran For an AttributeReference where the dataType is a nested struct such that the internal struct requires normalization (has a floating type expression), we end up not correctly propagating the nullability of an expression. For example, for an expression like: ``` namedStruct("struct", namedStruct("double", ) ``` The dataType prior to normalization is: ``` StructType(StructField("struct", StructType(StructField("double", DoubleType, true, {})), false, {})) ``` whereas post-normalization, the dataType becomes: ``` StructType(StructField("struct", StructType(StructField("double", DoubleType, true, {})), true, {})) ``` Essentially, we ended up converting the `nullable` attribute of "double" field from `false` to `true`. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-49743) OptimizeCsvJsonExpr should not change the schema of underlying StructType in GetArrayStructFields
[ https://issues.apache.org/jira/browse/SPARK-49743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nikhil Sheoran updated SPARK-49743: --- Description: The `OptimizeCsvJsonExprs` rule can potentially change the schema of the underlying `StructField` if there are differences in the field used to access the struct vs the field in the underlying struct. This surfaces as a correctness issue where instead of picking the values for the corresponding column we end up returning NULL. A simple example query is: {code:java} SELECT from_json('[{"a": '||id||', "b": '|| (2*id) ||'}]', 'array>').a, from_json('[{"a": '||id||', "b": '|| (2*id) ||'}]', 'array>').A FROM range(3) as t{code} Here, the result is `[0], [1], [2]` for `a` but `[null], [null], [null]` for `A`. Since struct field accessor is case-insensitive, the result should had been `[0], [1], [2]` for both. was: The `OptimizeCsvJsonExprs` rule can potentially change the schema of the underlying `StructField` if there are differences in the field used to access the struct vs the field in the underlying struct. This surfaces as a correctness issue where instead of picking the values for the corresponding column we end up returning NULL. A simple example query is: {code:java} SELECT from_json('[\{"a": '||id||', "b": '|| (2*id) ||'}]', 'array>').a, from_json('[\{"a": '||id||', "b": '|| (2*id) ||'}]', 'array>').A FROM range(3) as t{code} Here, the result is `[0], [1], [2]` for `a` but `[null], [null], [null]` for `A`. Since struct field accessor is case-insensitive, the result should had been `[0], [1], [2]` for both. > OptimizeCsvJsonExpr should not change the schema of underlying StructType in > GetArrayStructFields > - > > Key: SPARK-49743 > URL: https://issues.apache.org/jira/browse/SPARK-49743 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.2 >Reporter: Nikhil Sheoran >Priority: Major > Labels: pull-request-available > > The `OptimizeCsvJsonExprs` rule can potentially change the schema of the > underlying `StructField` if there are differences in the field used to access > the struct vs the field in the underlying struct. > This surfaces as a correctness issue where instead of picking the values for > the corresponding column we end up returning NULL. > > A simple example query is: > {code:java} > SELECT > from_json('[{"a": '||id||', "b": '|| (2*id) ||'}]', 'array b: INT>>').a, > from_json('[{"a": '||id||', "b": '|| (2*id) ||'}]', 'array b: INT>>').A > FROM > range(3) as t{code} > > > Here, the result is `[0], [1], [2]` for `a` but `[null], [null], [null]` for > `A`. Since struct field accessor is case-insensitive, the result should had > been `[0], [1], [2]` for both. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-49743) OptimizeCsvJsonExpr should not change the schema of underlying StructType in GetArrayStructFields
[ https://issues.apache.org/jira/browse/SPARK-49743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nikhil Sheoran updated SPARK-49743: --- Description: The `OptimizeCsvJsonExprs` rule can potentially change the schema of the underlying `StructField` if there are differences in the field used to access the struct vs the field in the underlying struct. This surfaces as a correctness issue where instead of picking the values for the corresponding column we end up returning NULL. A simple example query is: {code:java} SELECT from_json('[\{"a": '||id||', "b": '|| (2*id) ||'}]', 'array>').a, from_json('[\{"a": '||id||', "b": '|| (2*id) ||'}]', 'array>').A FROM range(3) as t{code} Here, the result is `[0], [1], [2]` for `a` but `[null], [null], [null]` for `A`. Since struct field accessor is case-insensitive, the result should had been `[0], [1], [2]` for both. was: The `OptimizeCsvJsonExprs` rule can potentially change the schema of the underlying `StructField` if there are differences in the field used to access the struct vs the field in the underlying struct. This surfaces as a correctness issue where instead of picking the values for the corresponding column we end up returning NULL. A simple example query is: ``` SELECT from_json('[\{"a": '||id||', "b": '|| (2*id) ||'}]', 'array>').a, from_json('[\{"a": '||id||', "b": '|| (2*id) ||'}]', 'array>').A FROM range(3) as t ``` Here, the result is `[0], [1], [2]` for `a` but `[null], [null], [null]` for `A`. Since struct field accessor is case-insensitive, the result should had been `[0], [1], [2]` for both. > OptimizeCsvJsonExpr should not change the schema of underlying StructType in > GetArrayStructFields > - > > Key: SPARK-49743 > URL: https://issues.apache.org/jira/browse/SPARK-49743 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.2 >Reporter: Nikhil Sheoran >Priority: Major > > The `OptimizeCsvJsonExprs` rule can potentially change the schema of the > underlying `StructField` if there are differences in the field used to access > the struct vs the field in the underlying struct. > This surfaces as a correctness issue where instead of picking the values for > the corresponding column we end up returning NULL. > > A simple example query is: > {code:java} > SELECT > from_json('[\{"a": '||id||', "b": '|| (2*id) ||'}]', 'array b: INT>>').a, > from_json('[\{"a": '||id||', "b": '|| (2*id) ||'}]', 'array b: INT>>').A > FROM > range(3) as t{code} > > > Here, the result is `[0], [1], [2]` for `a` but `[null], [null], [null]` for > `A`. Since struct field accessor is case-insensitive, the result should had > been `[0], [1], [2]` for both. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-49743) OptimizeCsvJsonExpr should not change the schema of underlying StructType in GetArrayStructFields
Nikhil Sheoran created SPARK-49743: -- Summary: OptimizeCsvJsonExpr should not change the schema of underlying StructType in GetArrayStructFields Key: SPARK-49743 URL: https://issues.apache.org/jira/browse/SPARK-49743 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.5.2 Reporter: Nikhil Sheoran The `OptimizeCsvJsonExprs` rule can potentially change the schema of the underlying `StructField` if there are differences in the field used to access the struct vs the field in the underlying struct. This surfaces as a correctness issue where instead of picking the values for the corresponding column we end up returning NULL. A simple example query is: ``` SELECT from_json('[\{"a": '||id||', "b": '|| (2*id) ||'}]', 'array>').a, from_json('[\{"a": '||id||', "b": '|| (2*id) ||'}]', 'array>').A FROM range(3) as t ``` Here, the result is `[0], [1], [2]` for `a` but `[null], [null], [null]` for `A`. Since struct field accessor is case-insensitive, the result should had been `[0], [1], [2]` for both. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48010) Avoid repeated calls to conf.resolver in resolveExpression
Nikhil Sheoran created SPARK-48010: -- Summary: Avoid repeated calls to conf.resolver in resolveExpression Key: SPARK-48010 URL: https://issues.apache.org/jira/browse/SPARK-48010 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.3 Reporter: Nikhil Sheoran Consider a view with a large number of columns (~1000s). When resolving this view, looking at the flamegraph, observed repeated initializations of `conf` to obtain the `resolver` for each column of the view. This can be easily optimized to reuse the same resolver (obtained once) for the various calls to `innerResolve` in `resolveExpression`. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46763) ReplaceDeduplicateWithAggregate fails when non-grouping keys have duplicate attributes
Nikhil Sheoran created SPARK-46763: -- Summary: ReplaceDeduplicateWithAggregate fails when non-grouping keys have duplicate attributes Key: SPARK-46763 URL: https://issues.apache.org/jira/browse/SPARK-46763 Project: Spark Issue Type: Bug Components: Optimizer Affects Versions: 3.5.0 Reporter: Nikhil Sheoran -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46640) RemoveRedundantAliases does not account for SubqueryExpression when removing aliases
[ https://issues.apache.org/jira/browse/SPARK-46640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nikhil Sheoran updated SPARK-46640: --- Fix Version/s: (was: 4.0.0) > RemoveRedundantAliases does not account for SubqueryExpression when removing > aliases > > > Key: SPARK-46640 > URL: https://issues.apache.org/jira/browse/SPARK-46640 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 4.0.0 >Reporter: Nikhil Sheoran >Priority: Minor > Labels: pull-request-available > > `RemoveRedundantAliases{{{}`{}}} does not take into account the outer > attributes of a `SubqueryExpression` aliases, potentially removing them if it > thinks they are redundant. > This can cause scenarios where a subquery expression has conditions like `a#x > = a#x` i.e. both the attribute names and the expression ID(s) are the same. > This can then lead to conflicting expression ID(s) error. > In `RemoveRedundantAliases`, we have an excluded AttributeSet argument > denoting the references for which we should not remove aliases. For a query > with a subquery expression, adding the references of this subquery in the > excluded set prevents such rewrite from happening. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46640) RemoveRedundantAliases does not account for SubqueryExpression when removing aliases
[ https://issues.apache.org/jira/browse/SPARK-46640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nikhil Sheoran updated SPARK-46640: --- Summary: RemoveRedundantAliases does not account for SubqueryExpression when removing aliases (was: RemoveRedundantAliases does not account for references in SubqueryExpression when removing aliases) > RemoveRedundantAliases does not account for SubqueryExpression when removing > aliases > > > Key: SPARK-46640 > URL: https://issues.apache.org/jira/browse/SPARK-46640 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 4.0.0 >Reporter: Nikhil Sheoran >Priority: Minor > Fix For: 4.0.0 > > > `RemoveRedundantAliases{{{}`{}}} does not take into account the outer > attributes of a `SubqueryExpression` aliases, potentially removing them if it > thinks they are redundant. > This can cause scenarios where a subquery expression has conditions like `a#x > = a#x` i.e. both the attribute names and the expression ID(s) are the same. > This can then lead to conflicting expression ID(s) error. > In `RemoveRedundantAliases`, we have an excluded AttributeSet argument > denoting the references for which we should not remove aliases. For a query > with a subquery expression, adding the references of this subquery in the > excluded set prevents such rewrite from happening. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46640) RemoveRedundantAliases does not account for references in SubqueryExpression when removing aliases
Nikhil Sheoran created SPARK-46640: -- Summary: RemoveRedundantAliases does not account for references in SubqueryExpression when removing aliases Key: SPARK-46640 URL: https://issues.apache.org/jira/browse/SPARK-46640 Project: Spark Issue Type: Bug Components: Optimizer Affects Versions: 4.0.0 Reporter: Nikhil Sheoran Fix For: 4.0.0 `RemoveRedundantAliases{{{}`{}}} does not take into account the outer attributes of a `SubqueryExpression` aliases, potentially removing them if it thinks they are redundant. This can cause scenarios where a subquery expression has conditions like `a#x = a#x` i.e. both the attribute names and the expression ID(s) are the same. This can then lead to conflicting expression ID(s) error. In `RemoveRedundantAliases`, we have an excluded AttributeSet argument denoting the references for which we should not remove aliases. For a query with a subquery expression, adding the references of this subquery in the excluded set prevents such rewrite from happening. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org