[jira] [Updated] (SPARK-49863) NormalizeFloatingNumbers degrades nullability of expressions with nested structs

2024-10-02 Thread Nikhil Sheoran (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nikhil Sheoran updated SPARK-49863:
---
Description: 
For an AttributeReference where the dataType is a nested struct such that the 
internal struct requires normalization (has a floating type expression), we end 
up not correctly propagating the nullability of an expression.

For example, for an expression like:
{code:java}
namedStruct("struct", namedStruct("double", ) {code}
The dataType prior to normalization is:
{code:java}
StructType(StructField("struct", StructType(StructField("double", DoubleType, 
true, {})), false, {})){code}
whereas post-normalization, the dataType becomes:
{code:java}
StructType(StructField("struct", StructType(StructField("double", DoubleType, 
true, {})), true, {})) 
{code}
 

We ended up converting the `nullable` attribute of the "double" field from 
`false` to `true`.

  was:
For an AttributeReference where the dataType is a nested struct such that the 
internal struct requires normalization (has a floating type expression), we end 
up not correctly propagating the nullability of an expression.

For example, for an expression like:
{code:java}
namedStruct("struct", namedStruct("double", ) {code}
 

The dataType prior to normalization is:
{code:java}
StructType(StructField("struct", StructType(StructField("double", DoubleType, 
true, {})), false, {})){code}
whereas post-normalization, the dataType becomes:

 
{code:java}
StructType(StructField("struct", StructType(StructField("double", DoubleType, 
true, {})), true, {})) 
{code}
Essentially, we ended up converting the `nullable` attribute of "double" field 
from `false` to `true`.


> NormalizeFloatingNumbers degrades nullability of expressions with nested 
> structs
> 
>
> Key: SPARK-49863
> URL: https://issues.apache.org/jira/browse/SPARK-49863
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Nikhil Sheoran
>Priority: Major
>
> For an AttributeReference where the dataType is a nested struct such that the 
> internal struct requires normalization (has a floating type expression), we 
> end up not correctly propagating the nullability of an expression.
> For example, for an expression like:
> {code:java}
> namedStruct("struct", namedStruct("double", ) {code}
> The dataType prior to normalization is:
> {code:java}
> StructType(StructField("struct", StructType(StructField("double", DoubleType, 
> true, {})), false, {})){code}
> whereas post-normalization, the dataType becomes:
> {code:java}
> StructType(StructField("struct", StructType(StructField("double", DoubleType, 
> true, {})), true, {})) 
> {code}
>  
> We ended up converting the `nullable` attribute of the "double" field from 
> `false` to `true`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-49863) NormalizeFloatingNumbers degrades nullability of expressions with nested structs

2024-10-02 Thread Nikhil Sheoran (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nikhil Sheoran updated SPARK-49863:
---
Description: 
For an AttributeReference where the dataType is a nested struct such that the 
internal struct requires normalization (has a floating type expression), we end 
up not correctly propagating the nullability of an expression.

For example, for an expression like:
{code:java}
namedStruct("struct", namedStruct("double", ) {code}
 

The dataType prior to normalization is:
{code:java}
StructType(StructField("struct", StructType(StructField("double", DoubleType, 
true, {})), false, {})){code}
whereas post-normalization, the dataType becomes:

 
{code:java}
StructType(StructField("struct", StructType(StructField("double", DoubleType, 
true, {})), true, {})) 
{code}
Essentially, we ended up converting the `nullable` attribute of "double" field 
from `false` to `true`.

  was:
For an AttributeReference where the dataType is a nested struct such that the 
internal struct requires normalization (has a floating type expression), we end 
up not correctly propagating the nullability of an expression.

For example, for an expression like:

```

namedStruct("struct", namedStruct("double", )

```

The dataType prior to normalization is:

```

StructType(StructField("struct", StructType(StructField("double", DoubleType, 
true, {})), false, {}))

```

whereas post-normalization, the dataType becomes:

```

StructType(StructField("struct", StructType(StructField("double", DoubleType, 
true, {})), true, {}))

```

 

Essentially, we ended up converting the `nullable` attribute of "double" field 
from `false` to `true`.


> NormalizeFloatingNumbers degrades nullability of expressions with nested 
> structs
> 
>
> Key: SPARK-49863
> URL: https://issues.apache.org/jira/browse/SPARK-49863
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Nikhil Sheoran
>Priority: Major
>
> For an AttributeReference where the dataType is a nested struct such that the 
> internal struct requires normalization (has a floating type expression), we 
> end up not correctly propagating the nullability of an expression.
> For example, for an expression like:
> {code:java}
> namedStruct("struct", namedStruct("double", ) {code}
>  
> The dataType prior to normalization is:
> {code:java}
> StructType(StructField("struct", StructType(StructField("double", DoubleType, 
> true, {})), false, {})){code}
> whereas post-normalization, the dataType becomes:
>  
> {code:java}
> StructType(StructField("struct", StructType(StructField("double", DoubleType, 
> true, {})), true, {})) 
> {code}
> Essentially, we ended up converting the `nullable` attribute of "double" 
> field from `false` to `true`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-49863) NormalizeFloatingNumbers degrades nullability of expressions with nested structs

2024-10-02 Thread Nikhil Sheoran (Jira)
Nikhil Sheoran created SPARK-49863:
--

 Summary: NormalizeFloatingNumbers degrades nullability of 
expressions with nested structs
 Key: SPARK-49863
 URL: https://issues.apache.org/jira/browse/SPARK-49863
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 4.0.0
Reporter: Nikhil Sheoran


For an AttributeReference where the dataType is a nested struct such that the 
internal struct requires normalization (has a floating type expression), we end 
up not correctly propagating the nullability of an expression.

For example, for an expression like:

```

namedStruct("struct", namedStruct("double", )

```

The dataType prior to normalization is:

```

StructType(StructField("struct", StructType(StructField("double", DoubleType, 
true, {})), false, {}))

```

whereas post-normalization, the dataType becomes:

```

StructType(StructField("struct", StructType(StructField("double", DoubleType, 
true, {})), true, {}))

```

 

Essentially, we ended up converting the `nullable` attribute of "double" field 
from `false` to `true`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-49743) OptimizeCsvJsonExpr should not change the schema of underlying StructType in GetArrayStructFields

2024-09-20 Thread Nikhil Sheoran (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nikhil Sheoran updated SPARK-49743:
---
Description: 
The `OptimizeCsvJsonExprs` rule can potentially change the schema of the 
underlying `StructField` if there are differences in the field used to access 
the struct vs the field in the underlying struct.

This surfaces as a correctness issue where instead of picking the values for 
the corresponding column we end up returning NULL.

 

A simple example query is:
{code:java}
SELECT
  from_json('[{"a": '||id||', "b": '|| (2*id) ||'}]', 'array>').a,
  from_json('[{"a": '||id||', "b": '|| (2*id) ||'}]', 'array>').A
FROM
  range(3) as t{code}
 

 

Here, the result is `[0], [1], [2]` for `a` but `[null], [null], [null]` for 
`A`. Since struct field accessor is case-insensitive, the result should had 
been `[0], [1], [2]` for both.

  was:
The `OptimizeCsvJsonExprs` rule can potentially change the schema of the 
underlying `StructField` if there are differences in the field used to access 
the struct vs the field in the underlying struct.

This surfaces as a correctness issue where instead of picking the values for 
the corresponding column we end up returning NULL.

 

A simple example query is:
{code:java}
SELECT
  from_json('[\{"a": '||id||', "b": '|| (2*id) ||'}]', 'array>').a,
  from_json('[\{"a": '||id||', "b": '|| (2*id) ||'}]', 'array>').A
FROM
  range(3) as t{code}
 

 

Here, the result is `[0], [1], [2]` for `a` but `[null], [null], [null]` for 
`A`. Since struct field accessor is case-insensitive, the result should had 
been `[0], [1], [2]` for both.


> OptimizeCsvJsonExpr should not change the schema of underlying StructType in 
> GetArrayStructFields
> -
>
> Key: SPARK-49743
> URL: https://issues.apache.org/jira/browse/SPARK-49743
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.2
>Reporter: Nikhil Sheoran
>Priority: Major
>  Labels: pull-request-available
>
> The `OptimizeCsvJsonExprs` rule can potentially change the schema of the 
> underlying `StructField` if there are differences in the field used to access 
> the struct vs the field in the underlying struct.
> This surfaces as a correctness issue where instead of picking the values for 
> the corresponding column we end up returning NULL.
>  
> A simple example query is:
> {code:java}
> SELECT
>   from_json('[{"a": '||id||', "b": '|| (2*id) ||'}]', 'array b: INT>>').a,
>   from_json('[{"a": '||id||', "b": '|| (2*id) ||'}]', 'array b: INT>>').A
> FROM
>   range(3) as t{code}
>  
>  
> Here, the result is `[0], [1], [2]` for `a` but `[null], [null], [null]` for 
> `A`. Since struct field accessor is case-insensitive, the result should had 
> been `[0], [1], [2]` for both.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-49743) OptimizeCsvJsonExpr should not change the schema of underlying StructType in GetArrayStructFields

2024-09-20 Thread Nikhil Sheoran (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nikhil Sheoran updated SPARK-49743:
---
Description: 
The `OptimizeCsvJsonExprs` rule can potentially change the schema of the 
underlying `StructField` if there are differences in the field used to access 
the struct vs the field in the underlying struct.

This surfaces as a correctness issue where instead of picking the values for 
the corresponding column we end up returning NULL.

 

A simple example query is:
{code:java}
SELECT
  from_json('[\{"a": '||id||', "b": '|| (2*id) ||'}]', 'array>').a,
  from_json('[\{"a": '||id||', "b": '|| (2*id) ||'}]', 'array>').A
FROM
  range(3) as t{code}
 

 

Here, the result is `[0], [1], [2]` for `a` but `[null], [null], [null]` for 
`A`. Since struct field accessor is case-insensitive, the result should had 
been `[0], [1], [2]` for both.

  was:
The `OptimizeCsvJsonExprs` rule can potentially change the schema of the 
underlying `StructField` if there are differences in the field used to access 
the struct vs the field in the underlying struct.

This surfaces as a correctness issue where instead of picking the values for 
the corresponding column we end up returning NULL.

 

A simple example query is:

```
SELECT
  from_json('[\{"a": '||id||', "b": '|| (2*id) ||'}]', 'array>').a,
  from_json('[\{"a": '||id||', "b": '|| (2*id) ||'}]', 'array>').A
FROM
  range(3) as t
```

Here, the result is `[0], [1], [2]` for `a` but `[null], [null], [null]` for 
`A`. Since struct field accessor is case-insensitive, the result should had 
been `[0], [1], [2]` for both.


> OptimizeCsvJsonExpr should not change the schema of underlying StructType in 
> GetArrayStructFields
> -
>
> Key: SPARK-49743
> URL: https://issues.apache.org/jira/browse/SPARK-49743
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.2
>Reporter: Nikhil Sheoran
>Priority: Major
>
> The `OptimizeCsvJsonExprs` rule can potentially change the schema of the 
> underlying `StructField` if there are differences in the field used to access 
> the struct vs the field in the underlying struct.
> This surfaces as a correctness issue where instead of picking the values for 
> the corresponding column we end up returning NULL.
>  
> A simple example query is:
> {code:java}
> SELECT
>   from_json('[\{"a": '||id||', "b": '|| (2*id) ||'}]', 'array b: INT>>').a,
>   from_json('[\{"a": '||id||', "b": '|| (2*id) ||'}]', 'array b: INT>>').A
> FROM
>   range(3) as t{code}
>  
>  
> Here, the result is `[0], [1], [2]` for `a` but `[null], [null], [null]` for 
> `A`. Since struct field accessor is case-insensitive, the result should had 
> been `[0], [1], [2]` for both.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-49743) OptimizeCsvJsonExpr should not change the schema of underlying StructType in GetArrayStructFields

2024-09-20 Thread Nikhil Sheoran (Jira)
Nikhil Sheoran created SPARK-49743:
--

 Summary: OptimizeCsvJsonExpr should not change the schema of 
underlying StructType in GetArrayStructFields
 Key: SPARK-49743
 URL: https://issues.apache.org/jira/browse/SPARK-49743
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.5.2
Reporter: Nikhil Sheoran


The `OptimizeCsvJsonExprs` rule can potentially change the schema of the 
underlying `StructField` if there are differences in the field used to access 
the struct vs the field in the underlying struct.

This surfaces as a correctness issue where instead of picking the values for 
the corresponding column we end up returning NULL.

 

A simple example query is:

```
SELECT
  from_json('[\{"a": '||id||', "b": '|| (2*id) ||'}]', 'array>').a,
  from_json('[\{"a": '||id||', "b": '|| (2*id) ||'}]', 'array>').A
FROM
  range(3) as t
```

Here, the result is `[0], [1], [2]` for `a` but `[null], [null], [null]` for 
`A`. Since struct field accessor is case-insensitive, the result should had 
been `[0], [1], [2]` for both.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48010) Avoid repeated calls to conf.resolver in resolveExpression

2024-04-26 Thread Nikhil Sheoran (Jira)
Nikhil Sheoran created SPARK-48010:
--

 Summary: Avoid repeated calls to conf.resolver in resolveExpression
 Key: SPARK-48010
 URL: https://issues.apache.org/jira/browse/SPARK-48010
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.3
Reporter: Nikhil Sheoran


Consider a view with a large number of columns (~1000s). When resolving this 
view, looking at the flamegraph, observed repeated initializations of `conf` to 
obtain the `resolver` for each column of the view.

This can be easily optimized to reuse the same resolver (obtained once) for the 
various calls to `innerResolve` in `resolveExpression`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46763) ReplaceDeduplicateWithAggregate fails when non-grouping keys have duplicate attributes

2024-01-18 Thread Nikhil Sheoran (Jira)
Nikhil Sheoran created SPARK-46763:
--

 Summary: ReplaceDeduplicateWithAggregate fails when non-grouping 
keys have duplicate attributes
 Key: SPARK-46763
 URL: https://issues.apache.org/jira/browse/SPARK-46763
 Project: Spark
  Issue Type: Bug
  Components: Optimizer
Affects Versions: 3.5.0
Reporter: Nikhil Sheoran






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46640) RemoveRedundantAliases does not account for SubqueryExpression when removing aliases

2024-01-10 Thread Nikhil Sheoran (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nikhil Sheoran updated SPARK-46640:
---
Fix Version/s: (was: 4.0.0)

> RemoveRedundantAliases does not account for SubqueryExpression when removing 
> aliases
> 
>
> Key: SPARK-46640
> URL: https://issues.apache.org/jira/browse/SPARK-46640
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 4.0.0
>Reporter: Nikhil Sheoran
>Priority: Minor
>  Labels: pull-request-available
>
> `RemoveRedundantAliases{{{}`{}}} does not take into account the outer 
> attributes of a `SubqueryExpression` aliases, potentially removing them if it 
> thinks they are redundant.
> This can cause scenarios where a subquery expression has conditions like `a#x 
> = a#x` i.e. both the attribute names and the expression ID(s) are the same. 
> This can then lead to conflicting expression ID(s) error.
> In `RemoveRedundantAliases`, we have an excluded AttributeSet argument 
> denoting the references for which we should not remove aliases. For a query 
> with a subquery expression, adding the references of this subquery in the 
> excluded set prevents such rewrite from happening.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46640) RemoveRedundantAliases does not account for SubqueryExpression when removing aliases

2024-01-09 Thread Nikhil Sheoran (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nikhil Sheoran updated SPARK-46640:
---
Summary: RemoveRedundantAliases does not account for SubqueryExpression 
when removing aliases  (was: RemoveRedundantAliases does not account for 
references in SubqueryExpression when removing aliases)

> RemoveRedundantAliases does not account for SubqueryExpression when removing 
> aliases
> 
>
> Key: SPARK-46640
> URL: https://issues.apache.org/jira/browse/SPARK-46640
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 4.0.0
>Reporter: Nikhil Sheoran
>Priority: Minor
> Fix For: 4.0.0
>
>
> `RemoveRedundantAliases{{{}`{}}} does not take into account the outer 
> attributes of a `SubqueryExpression` aliases, potentially removing them if it 
> thinks they are redundant.
> This can cause scenarios where a subquery expression has conditions like `a#x 
> = a#x` i.e. both the attribute names and the expression ID(s) are the same. 
> This can then lead to conflicting expression ID(s) error.
> In `RemoveRedundantAliases`, we have an excluded AttributeSet argument 
> denoting the references for which we should not remove aliases. For a query 
> with a subquery expression, adding the references of this subquery in the 
> excluded set prevents such rewrite from happening.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46640) RemoveRedundantAliases does not account for references in SubqueryExpression when removing aliases

2024-01-09 Thread Nikhil Sheoran (Jira)
Nikhil Sheoran created SPARK-46640:
--

 Summary: RemoveRedundantAliases does not account for references in 
SubqueryExpression when removing aliases
 Key: SPARK-46640
 URL: https://issues.apache.org/jira/browse/SPARK-46640
 Project: Spark
  Issue Type: Bug
  Components: Optimizer
Affects Versions: 4.0.0
Reporter: Nikhil Sheoran
 Fix For: 4.0.0


`RemoveRedundantAliases{{{}`{}}} does not take into account the outer 
attributes of a `SubqueryExpression` aliases, potentially removing them if it 
thinks they are redundant.

This can cause scenarios where a subquery expression has conditions like `a#x = 
a#x` i.e. both the attribute names and the expression ID(s) are the same. This 
can then lead to conflicting expression ID(s) error.

In `RemoveRedundantAliases`, we have an excluded AttributeSet argument denoting 
the references for which we should not remove aliases. For a query with a 
subquery expression, adding the references of this subquery in the excluded set 
prevents such rewrite from happening.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org