from:"Asif \(Jira\)"

[jira] [Updated] (SPARK-47609) CacheManager Lookup can miss picking InMemoryRelation corresponding to subplan

2024-03-27 Thread Asif (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asif updated SPARK-47609:
-
Description: 
This issue became apparent while bringing my PR 
[https://github.com/apache/spark/pull/43854]

in synch with latest master.

Though that PR is meant to do early collapse of projects so that the tree size 
is kept at minimum when projects keep getting added , in the analyzer phase 
itself.

But as part of the work, the CacheManager lookup also needed to be modified.

One of the newly added test in master failed. On analysis of failure it turns 
out that the cache manager is not picking cached InMemoryRelation for a subplan.

This shows up in following existing test

org.apache.spark.sql.DatasetCacheSuite
{quote}test("SPARK-26708 Cache data and cached plan should stay consistent") {
val df = spark.range(0, 5).toDF("a")
val df1 = df.withColumn("b", $"a" + 1)
val df2 = df.filter($"a" > 1)

df.cache()
// Add df1 to the CacheManager; the buffer is currently empty.
df1.cache()
{color:#4c9aff}// After calling collect(), df1's buffer has been loaded.{color}
df1.collect()
// Add df2 to the CacheManager; the buffer is currently empty.
df2.cache()

// Verify that df1 is a InMemoryRelation plan with dependency on another cached 
plan.
assertCacheDependency(df1)
val df1InnerPlan = df1.queryExecution.withCachedData
.asInstanceOf[InMemoryRelation].cacheBuilder.cachedPlan
// Verify that df2 is a InMemoryRelation plan with dependency on another cached 
plan.
assertCacheDependency(df2)

df.unpersist(blocking = true)

{color:#00875a}// Verify that df1's cache has stayed the same, since df1's 
cache already has data{color}
// before df.unpersist().
val df1Limit = df1.limit(2)
val df1LimitInnerPlan = df1Limit.queryExecution.withCachedData.collectFirst
Unknown macro: \{ case i}
assert(df1LimitInnerPlan.isDefined && df1LimitInnerPlan.get == df1InnerPlan)

// Verify that df2's cache has been re-cached, with a new physical plan rid of 
dependency
// on df, since df2's cache had not been loaded before df.unpersist().
val df2Limit = df2.limit(2)
val df2LimitInnerPlan = df2Limit.queryExecution.withCachedData.collectFirst
Unknown macro: \{ case i}
{quote}
{quote}*{color:#de350b}// This assertion is not right{color}*
assert(df2LimitInnerPlan.isDefined &&
!df2LimitInnerPlan.get.exists(_.isInstanceOf[InMemoryTableScanExec]))
}
{quote}
 

Since df1 exists in the cache as InMemoryRelation,

val df = spark.range(0, 5).toDF("a")
val df1 = df.withColumn("b", $"a" + 1)
val df2 = df.filter($"a" > 1)

df2 is derivable from the cached df1.

So when val df2Limit = df2.limit(2), is created,  it should utilize the cached 
df1 .

 

The pull request for the same is 

[https://github.com/apache/spark/pull/43854|https://github.com/apache/spark/pull/43854]

  was:
This issue became apparent while bringing my PR 
[https://github.com/apache/spark/pull/43854]

in synch with latest master.

Though that PR is meant to do early collapse of projects so that the tree size 
is kept at minimum when projects keep getting added , in the analyzer phase 
itself.

But as part of the work, the CacheManager lookup also needed to be modified.

One of the newly added test in master failed. On analysis of failure it turns 
out that the cache manager is not picking cached InMemoryRelation for a subplan.

This shows up in following existing test

org.apache.spark.sql.DatasetCacheSuite
{quote}test("SPARK-26708 Cache data and cached plan should stay consistent") {
val df = spark.range(0, 5).toDF("a")
val df1 = df.withColumn("b", $"a" + 1)
val df2 = df.filter($"a" > 1)

df.cache()
// Add df1 to the CacheManager; the buffer is currently empty.
df1.cache()
{color:#4c9aff}// After calling collect(), df1's buffer has been loaded.{color}
df1.collect()
// Add df2 to the CacheManager; the buffer is currently empty.
df2.cache()

// Verify that df1 is a InMemoryRelation plan with dependency on another cached 
plan.
assertCacheDependency(df1)
val df1InnerPlan = df1.queryExecution.withCachedData
.asInstanceOf[InMemoryRelation].cacheBuilder.cachedPlan
// Verify that df2 is a InMemoryRelation plan with dependency on another cached 
plan.
assertCacheDependency(df2)

df.unpersist(blocking = true)

{color:#00875a}// Verify that df1's cache has stayed the same, since df1's 
cache already has data{color}
// before df.unpersist().
val df1Limit = df1.limit(2)
val df1LimitInnerPlan = df1Limit.queryExecution.withCachedData.collectFirst
Unknown macro: \{ case i}
assert(df1LimitInnerPlan.isDefined && df1LimitInnerPlan.get == df1InnerPlan)

// Verify that df2's cache has been re-cached, with a new physical plan rid of 
dependency
// on df, since df2's cache had not been loaded before df.unpersist().
val df2Limit = df2.limit(2)
val df2LimitInnerPlan = df2Limit.queryExecution.withCachedData.collectFirst
Unknown macro: \{ case i}{quote}
{quote}*{color:#de350b}// This assertion is not right{color}*

[jira] [Updated] (SPARK-47609) CacheManager Lookup can miss picking InMemoryRelation corresponding to subplan

2024-03-26 Thread Asif (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asif updated SPARK-47609:
-
Description: 
This issue became apparent while bringing my PR 
[https://github.com/apache/spark/pull/43854]

in synch with latest master.

Though that PR is meant to do early collapse of projects so that the tree size 
is kept at minimum when projects keep getting added , in the analyzer phase 
itself.

But as part of the work, the CacheManager lookup also needed to be modified.

One of the newly added test in master failed. On analysis of failure it turns 
out that the cache manager is not picking cached InMemoryRelation for a subplan.

This shows up in following existing test

org.apache.spark.sql.DatasetCacheSuite
{quote}test("SPARK-26708 Cache data and cached plan should stay consistent") {
val df = spark.range(0, 5).toDF("a")
val df1 = df.withColumn("b", $"a" + 1)
val df2 = df.filter($"a" > 1)

df.cache()
// Add df1 to the CacheManager; the buffer is currently empty.
df1.cache()
{color:#4c9aff}// After calling collect(), df1's buffer has been loaded.{color}
df1.collect()
// Add df2 to the CacheManager; the buffer is currently empty.
df2.cache()

// Verify that df1 is a InMemoryRelation plan with dependency on another cached 
plan.
assertCacheDependency(df1)
val df1InnerPlan = df1.queryExecution.withCachedData
.asInstanceOf[InMemoryRelation].cacheBuilder.cachedPlan
// Verify that df2 is a InMemoryRelation plan with dependency on another cached 
plan.
assertCacheDependency(df2)

df.unpersist(blocking = true)

{color:#00875a}// Verify that df1's cache has stayed the same, since df1's 
cache already has data{color}
// before df.unpersist().
val df1Limit = df1.limit(2)
val df1LimitInnerPlan = df1Limit.queryExecution.withCachedData.collectFirst
Unknown macro: \{ case i}
assert(df1LimitInnerPlan.isDefined && df1LimitInnerPlan.get == df1InnerPlan)

// Verify that df2's cache has been re-cached, with a new physical plan rid of 
dependency
// on df, since df2's cache had not been loaded before df.unpersist().
val df2Limit = df2.limit(2)
val df2LimitInnerPlan = df2Limit.queryExecution.withCachedData.collectFirst
Unknown macro: \{ case i}{quote}
{quote}*{color:#de350b}// This assertion is not right{color}*
assert(df2LimitInnerPlan.isDefined &&
!df2LimitInnerPlan.get.exists(_.isInstanceOf[InMemoryTableScanExec]))
}
{quote}
 

Since df1 exists in the cache as InMemoryRelation,

val df = spark.range(0, 5).toDF("a")
val df1 = df.withColumn("b", $"a" + 1)
val df2 = df.filter($"a" > 1)

df2 is derivable from the cached df1.

So when val df2Limit = df2.limit(2), is created,  it should utilize the cached 
df1 .

  was:
This issue became apparent while bringing my PR 
[https://github.com/apache/spark/pull/43854]

in synch with latest master.

Though that PR is meant to do early collapse of projects so that the tree size 
is kept at minimum when projects keep getting added , in the analyzer phase 
itself.

But as part of the work, the CacheManager lookup also needed to be modified.

One of the newly added test in master failed. On analysis of failure it turns 
out that the cache manager is not picking cached InMemoryRelation for a subplan.

This shows up in following existing test
{quote}test("SPARK-26708 Cache data and cached plan should stay consistent") {
val df = spark.range(0, 5).toDF("a")
val df1 = df.withColumn("b", $"a" + 1)
val df2 = df.filter($"a" > 1)

df.cache()
// Add df1 to the CacheManager; the buffer is currently empty.
df1.cache()
{color:#4c9aff}// After calling collect(), df1's buffer has been loaded.{color}
df1.collect()
// Add df2 to the CacheManager; the buffer is currently empty.
df2.cache()

// Verify that df1 is a InMemoryRelation plan with dependency on another cached 
plan.
assertCacheDependency(df1)
val df1InnerPlan = df1.queryExecution.withCachedData
.asInstanceOf[InMemoryRelation].cacheBuilder.cachedPlan
// Verify that df2 is a InMemoryRelation plan with dependency on another cached 
plan.
assertCacheDependency(df2)

df.unpersist(blocking = true)

{color:#00875a}// Verify that df1's cache has stayed the same, since df1's 
cache already has data{color}
// before df.unpersist().
val df1Limit = df1.limit(2)
val df1LimitInnerPlan = df1Limit.queryExecution.withCachedData.collectFirst {
case i: InMemoryRelation => i.cacheBuilder.cachedPlan
}
assert(df1LimitInnerPlan.isDefined && df1LimitInnerPlan.get == df1InnerPlan)

// Verify that df2's cache has been re-cached, with a new physical plan rid of 
dependency
// on df, since df2's cache had not been loaded before df.unpersist().
val df2Limit = df2.limit(2)
val df2LimitInnerPlan = df2Limit.queryExecution.withCachedData.collectFirst {
case i: InMemoryRelation => i.cacheBuilder.cachedPlan
}{quote}
{quote}*{color:#de350b}// This assertion is not right{color}*
assert(df2LimitInnerPlan.isDefined &&
!df2LimitInnerPlan.get.exists(_.isInstanceOf[InMemoryTableScanExec]))

[jira] [Created] (SPARK-47609) CacheManager Lookup can miss picking InMemoryRelation corresponding to subplan

2024-03-26 Thread Asif (Jira)

Asif created SPARK-47609:


 Summary: CacheManager Lookup can miss picking InMemoryRelation 
corresponding to subplan
 Key: SPARK-47609
 URL: https://issues.apache.org/jira/browse/SPARK-47609
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.5.1
Reporter: Asif


This issue became apparent while bringing my PR 
[https://github.com/apache/spark/pull/43854]

in synch with latest master.

Though that PR is meant to do early collapse of projects so that the tree size 
is kept at minimum when projects keep getting added , in the analyzer phase 
itself.

But as part of the work, the CacheManager lookup also needed to be modified.

One of the newly added test in master failed. On analysis of failure it turns 
out that the cache manager is not picking cached InMemoryRelation for a subplan.

This shows up in following existing test
{quote}test("SPARK-26708 Cache data and cached plan should stay consistent") {
val df = spark.range(0, 5).toDF("a")
val df1 = df.withColumn("b", $"a" + 1)
val df2 = df.filter($"a" > 1)

df.cache()
// Add df1 to the CacheManager; the buffer is currently empty.
df1.cache()
{color:#4c9aff}// After calling collect(), df1's buffer has been loaded.{color}
df1.collect()
// Add df2 to the CacheManager; the buffer is currently empty.
df2.cache()

// Verify that df1 is a InMemoryRelation plan with dependency on another cached 
plan.
assertCacheDependency(df1)
val df1InnerPlan = df1.queryExecution.withCachedData
.asInstanceOf[InMemoryRelation].cacheBuilder.cachedPlan
// Verify that df2 is a InMemoryRelation plan with dependency on another cached 
plan.
assertCacheDependency(df2)

df.unpersist(blocking = true)

{color:#00875a}// Verify that df1's cache has stayed the same, since df1's 
cache already has data{color}
// before df.unpersist().
val df1Limit = df1.limit(2)
val df1LimitInnerPlan = df1Limit.queryExecution.withCachedData.collectFirst {
case i: InMemoryRelation => i.cacheBuilder.cachedPlan
}
assert(df1LimitInnerPlan.isDefined && df1LimitInnerPlan.get == df1InnerPlan)

// Verify that df2's cache has been re-cached, with a new physical plan rid of 
dependency
// on df, since df2's cache had not been loaded before df.unpersist().
val df2Limit = df2.limit(2)
val df2LimitInnerPlan = df2Limit.queryExecution.withCachedData.collectFirst {
case i: InMemoryRelation => i.cacheBuilder.cachedPlan
}{quote}
{quote}*{color:#de350b}// This assertion is not right{color}*
assert(df2LimitInnerPlan.isDefined &&
!df2LimitInnerPlan.get.exists(_.isInstanceOf[InMemoryTableScanExec]))
}{quote}
 

Since df1 exists in the cache as InMemoryRelation,

val df = spark.range(0, 5).toDF("a")
val df1 = df.withColumn("b", $"a" + 1)
val df2 = df.filter($"a" > 1)

df2 is derivable from the cached df1.

So when val df2Limit = df2.limit(2), is created,  it should utilize the cached 
df1 .



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-26708) Incorrect result caused by inconsistency between a SQL cache's cached RDD and its physical plan

2024-03-26 Thread Asif (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-26708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17831116#comment-17831116
 ] 

Asif edited comment on SPARK-26708 at 3/27/24 12:58 AM:


I believe the current caching logic is suboptimal and accordingly the bug test 
for it is testing a suboptimal approach.

The bug test for this is
{quote}test("SPARK-26708 Cache data and cached plan should stay consistent") {
val df = spark.range(0, 5).toDF("a")
val df1 = df.withColumn("b", $"a" + 1)
val df2 = df.filter($"a" > 1)

df.cache()
// Add df1 to the CacheManager; the buffer is currently empty.
df1.cache()
// After calling collect(), df1's buffer has been loaded.
df1.collect()
// Add df2 to the CacheManager; the buffer is currently empty.
df2.cache()

// Verify that df1 is a InMemoryRelation plan with dependency on another cached 
plan.
assertCacheDependency(df1)
val df1InnerPlan = df1.queryExecution.withCachedData
.asInstanceOf[InMemoryRelation].cacheBuilder.cachedPlan
// Verify that df2 is a InMemoryRelation plan with dependency on another cached 
plan.
assertCacheDependency(df2)

df.unpersist(blocking = true)

// Verify that df1's cache has stayed the same, since df1's cache already has 
data
// before df.unpersist().
val df1Limit = df1.limit(2)
val df1LimitInnerPlan = df1Limit.queryExecution.withCachedData.collectFirst
Unknown macro: \{ case i}
assert(df1LimitInnerPlan.isDefined && df1LimitInnerPlan.get == df1InnerPlan)

// Verify that df2's cache has been re-cached, with a new physical plan rid of 
dependency
// on df, since df2's cache had not been loaded before df.unpersist().
val df2Limit = df2.limit(2)
val df2LimitInnerPlan = df2Limit.queryExecution.withCachedData.collectFirst
Unknown macro: \{ case i}
assert(df2LimitInnerPlan.isDefined &&
!df2LimitInnerPlan.get.exists(_.isInstanceOf[InMemoryTableScanExec]))
}
{quote}
 

The optimal caching should have resulted in df2LimitInnerPlan  actually 
containing  InMemoryTableScanExec which should have corresponded to df1.

The reason being that since df1 was already materialized, so it exists in the 
cache rightly.

And df2 is derivable from the cached df1 ( it just has extra projection but 
otherwise can serve the df2).


was (Author: ashahid7):
I believe the current caching logic is suboptimal and accordingly the bug test 
for it is testing a suboptimal approach.

The bug test for this is
{quote}test("SPARK-26708 Cache data and cached plan should stay consistent") {
val df = spark.range(0, 5).toDF("a")
val df1 = df.withColumn("b", $"a" + 1)
val df2 = df.filter($"a" > 1)

df.cache()
// Add df1 to the CacheManager; the buffer is currently empty.
df1.cache()
// After calling collect(), df1's buffer has been loaded.
df1.collect()
// Add df2 to the CacheManager; the buffer is currently empty.
df2.cache()

// Verify that df1 is a InMemoryRelation plan with dependency on another cached 
plan.
assertCacheDependency(df1)
val df1InnerPlan = df1.queryExecution.withCachedData
.asInstanceOf[InMemoryRelation].cacheBuilder.cachedPlan
// Verify that df2 is a InMemoryRelation plan with dependency on another cached 
plan.
assertCacheDependency(df2)

df.unpersist(blocking = true)

// Verify that df1's cache has stayed the same, since df1's cache already has 
data
// before df.unpersist().
val df1Limit = df1.limit(2)
val df1LimitInnerPlan = df1Limit.queryExecution.withCachedData.collectFirst {
case i: InMemoryRelation => i.cacheBuilder.cachedPlan
}
assert(df1LimitInnerPlan.isDefined && df1LimitInnerPlan.get == df1InnerPlan)

// Verify that df2's cache has been re-cached, with a new physical plan rid of 
dependency
// on df, since df2's cache had not been loaded before df.unpersist().
val df2Limit = df2.limit(2)
val df2LimitInnerPlan = df2Limit.queryExecution.withCachedData.collectFirst {
case i: InMemoryRelation => i.cacheBuilder.cachedPlan
}
assert(df2LimitInnerPlan.isDefined &&
!df2LimitInnerPlan.get.exists(_.isInstanceOf[InMemoryTableScanExec]))
}{quote}
 

The optimal caching should have resulted in df2LimitInnerPlan  actually 
containing  InMemoryTableScanExec which should have corresponded to df1.

The reason being that since df2 was already materialized, so it exists in the 
cache rightly.

And df2 is derivable from the cached df1 ( it just has extra projection but 
otherwise can serve the df2).

> Incorrect result caused by inconsistency between a SQL cache's cached RDD and 
> its physical plan
> ---
>
> Key: SPARK-26708
> URL: https://issues.apache.org/jira/browse/SPARK-26708
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Xiao Li
>Assignee: Wei Xue
>Priority: Blocker
>  Labels: correctness
> Fix For: 2.4.1, 3.0.0
>
>
> When

[jira] [Comment Edited] (SPARK-26708) Incorrect result caused by inconsistency between a SQL cache's cached RDD and its physical plan

2024-03-26 Thread Asif (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-26708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17831117#comment-17831117
 ] 

Asif edited comment on SPARK-26708 at 3/27/24 12:54 AM:


Towards that please take a look at ticket & PR:

https://issues.apache.org/jira/browse/SPARK-45959

 

and the PR associated with it.

Though that PR primarily deals with aggressive collapse of projects at the end 
of analysis . But it also as part of fix, uses enhanced cached plan lookup and 
thus results in the above behaviour.


was (Author: ashahid7):
Towards that please take a look at ticket & PR:

[https://issues.apache.org/jira/browse/SPARK-45959|https://issues.apache.org/jira/browse/SPARK-45959]

> Incorrect result caused by inconsistency between a SQL cache's cached RDD and 
> its physical plan
> ---
>
> Key: SPARK-26708
> URL: https://issues.apache.org/jira/browse/SPARK-26708
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Xiao Li
>Assignee: Wei Xue
>Priority: Blocker
>  Labels: correctness
> Fix For: 2.4.1, 3.0.0
>
>
> When performing non-cascading cache invalidation, {{recache}} is called on 
> the other cache entries which are dependent on the cache being invalidated. 
> It leads to the the physical plans of those cache entries being re-compiled. 
> For those cache entries, if the cache RDD has already been persisted, chances 
> are there will be inconsistency between the data and the new plan. It can 
> cause a correctness issue if the new plan's {{outputPartitioning}} or 
> {{outputOrdering}} is different from the that of the actual data, and 
> meanwhile the cache is used by another query that asks for specific 
> {{outputPartitioning}} or {{outputOrdering}} which happens to match the new 
> plan but not the actual data.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26708) Incorrect result caused by inconsistency between a SQL cache's cached RDD and its physical plan

2024-03-26 Thread Asif (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-26708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17831116#comment-17831116
 ] 

Asif commented on SPARK-26708:
--

I believe the current caching logic is suboptimal and accordingly the bug test 
for it is testing a suboptimal approach.

The bug test for this is
{quote}test("SPARK-26708 Cache data and cached plan should stay consistent") {
val df = spark.range(0, 5).toDF("a")
val df1 = df.withColumn("b", $"a" + 1)
val df2 = df.filter($"a" > 1)

df.cache()
// Add df1 to the CacheManager; the buffer is currently empty.
df1.cache()
// After calling collect(), df1's buffer has been loaded.
df1.collect()
// Add df2 to the CacheManager; the buffer is currently empty.
df2.cache()

// Verify that df1 is a InMemoryRelation plan with dependency on another cached 
plan.
assertCacheDependency(df1)
val df1InnerPlan = df1.queryExecution.withCachedData
.asInstanceOf[InMemoryRelation].cacheBuilder.cachedPlan
// Verify that df2 is a InMemoryRelation plan with dependency on another cached 
plan.
assertCacheDependency(df2)

df.unpersist(blocking = true)

// Verify that df1's cache has stayed the same, since df1's cache already has 
data
// before df.unpersist().
val df1Limit = df1.limit(2)
val df1LimitInnerPlan = df1Limit.queryExecution.withCachedData.collectFirst {
case i: InMemoryRelation => i.cacheBuilder.cachedPlan
}
assert(df1LimitInnerPlan.isDefined && df1LimitInnerPlan.get == df1InnerPlan)

// Verify that df2's cache has been re-cached, with a new physical plan rid of 
dependency
// on df, since df2's cache had not been loaded before df.unpersist().
val df2Limit = df2.limit(2)
val df2LimitInnerPlan = df2Limit.queryExecution.withCachedData.collectFirst {
case i: InMemoryRelation => i.cacheBuilder.cachedPlan
}
assert(df2LimitInnerPlan.isDefined &&
!df2LimitInnerPlan.get.exists(_.isInstanceOf[InMemoryTableScanExec]))
}{quote}
 

The optimal caching should have resulted in df2LimitInnerPlan  actually 
containing  InMemoryTableScanExec which should have corresponded to df1.

The reason being that since df2 was already materialized, so it exists in the 
cache rightly.

And df2 is derivable from the cached df1 ( it just has extra projection but 
otherwise can serve the df2).

> Incorrect result caused by inconsistency between a SQL cache's cached RDD and 
> its physical plan
> ---
>
> Key: SPARK-26708
> URL: https://issues.apache.org/jira/browse/SPARK-26708
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Xiao Li
>Assignee: Wei Xue
>Priority: Blocker
>  Labels: correctness
> Fix For: 2.4.1, 3.0.0
>
>
> When performing non-cascading cache invalidation, {{recache}} is called on 
> the other cache entries which are dependent on the cache being invalidated. 
> It leads to the the physical plans of those cache entries being re-compiled. 
> For those cache entries, if the cache RDD has already been persisted, chances 
> are there will be inconsistency between the data and the new plan. It can 
> cause a correctness issue if the new plan's {{outputPartitioning}} or 
> {{outputOrdering}} is different from the that of the actual data, and 
> meanwhile the cache is used by another query that asks for specific 
> {{outputPartitioning}} or {{outputOrdering}} which happens to match the new 
> plan but not the actual data.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47320) Datasets involving self joins behave in an inconsistent and unintuitive manner

2024-03-09 Thread Asif (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asif updated SPARK-47320:
-
Description: 
The behaviour of Datasets involving self joins behave in an unintuitive manner 
in terms when AnalysisException is thrown due to ambiguity and when it works.

Found situations where join order swapping causes query to throw Ambiguity 
related exceptions which otherwise passes.  Some of the Datasets which from 
user perspective are un-ambiguous will result in Analysis Exception getting 
thrown.

After testing and fixing a bug , I think the issue lies in inconsistency in 
determining what constitutes ambiguous and what is un-ambiguous.

There are two ways to look at resolution regarding ambiguity

1) ExprId of attributes : This is unintuitive approach as spark users do not 
bother with the ExprIds

2) Column Extraction from the Dataset using df(col) api : Which is the user 
visible/understandable Point of View.  So determining ambiguity should be based 
on this. What is Logically unambiguous from users perspective ( assuming its is 
logically correct) , should also be the basis of spark product, to decide on 
un-ambiguity.

For Example:
{quote}val df1 = Seq((1, 2)).toDF("a", "b")
val df2 = Seq((1, 2)).toDF("aa", "bb")
val df1Joindf2 = df1.join(df2, df1("a") === df2("aa")).select(df1("a"),
df2("aa"), df1("b"))
val df3 = df1Joindf2.join(df1, df1Joindf2("aa") === df1("a")).select(df1("a"))
{quote}
The above code from perspective #1 should throw ambiguity exception, because 
the join condition and projection of df3 dataframe, has df1("a) which has 
exprId which matches both df1Joindf2 and df1.

But if we look is from perspective of Dataset used to get column, which is the 
intent of the user, the expectation is that df1("a) should be resolved to 
Dataset df1 being joined, and not 
df1Joindf2. If user intended "a" from df1Joindf2, then would have used 
df1Joindf2("a")

So In this case , current spark throws Exception as it is using resolution 
based on # 1

But the below Dataframe by the above logic, should also throw Ambiguity 
Exception but it passes
{quote}val df1 = Seq((1, 2)).toDF("a", "b")
val df2 = Seq((1, 2)).toDF("aa", "bb")
val df1Joindf2 = df1.join(df2, df1("a") === df2("aa")).select(df1("a"),
df2("aa"), df1("b"))

df1Joindf2.join(df1, df1Joindf2("a") === df1("a"))
{quote}
The difference in the 2 cases is that in the first case , select is present.
But in the 2nd query, select is not there.

So this implies that in 1st case the df1("a") in projection is causing 
ambiguity issue, but same reference in 2nd case, used just in condition, is 
considered un-ambiguous.

IMHO , the ambiguity identification criteria should be based totally on #2 and 
consistently.

In the DataFrameJoinTest and DataFrameSelfJoinTest, if we go by #2, some of the 
tests which are being considered ambiguous ( on # 1 criteria) become 
un-ambiguous using (#2) criteria.

There is an existing test in DataFrameSelfJoinSuite
{quote}test("SPARK-28344: fail ambiguous self join - column ref in Project") 
val df1 = spark.range(3)
val df2 = df1.filter($"id" > 0)

Assertion1 : existing 
assertAmbiguousSelfJoin(df1.join(df2).select(df2("id")))

Assertion2 : added by me
assertAmbiguousSelfJoin(df2.join(df1).select(df2("id")))

}
{quote}
Here the Assertion1 passes ( that is ambiguous exception is thrown)
But the Assertion2 fails ( that is no ambiguous exception is thrown)
The only chnage is the join order

Logically both the assertions are invalid ( In the sense both should NOT be 
throwing Exception as from the user's perspective there is no ambiguity.

 

Also much of this confusion arises, because join conditions are attempted being 
resolved on the "un-deduplicated" plan. Attempt to resolve join condition 
should be made after the deduplication of Join Plan. Which is what the PR for 
the bug does.

 

  was:
The behaviour of Datasets involving self joins behave in an unintuitive manner 
in terms when AnalysisException is thrown due to ambiguity and when it works.

Found situations where join order swapping causes query to throw Ambiguity 
related exceptions which otherwise passes.  Some of the Datasets which from 
user perspective are un-ambiguous will result in Analysis Exception getting 
thrown.

After testing and fixing a bug , I think the issue lies in inconsistency in 
determining what constitutes ambiguous and what is un-ambiguous.

There are two ways to look at resolution regarding ambiguity

1) ExprId of attributes : This is unintuitive approach as spark users do not 
bother with the ExprIds

2) Column Extraction from the Dataset using df(col) api : Which is the user 
visible/understandable Point of View.  So determining ambiguity should be based 
on this. What is Logically unambiguous from users perspective ( assuming its is 
logically correct) , should also be the basis of spark product, to decide on 
un-ambiguity.

For Example:

[jira] [Updated] (SPARK-47320) Datasets involving self joins behave in an inconsistent and unintuitive manner

2024-03-08 Thread Asif (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asif updated SPARK-47320:
-
Labels: pull-request-available  (was: )

> Datasets involving self joins behave in an inconsistent and unintuitive  
> manner 
> 
>
> Key: SPARK-47320
> URL: https://issues.apache.org/jira/browse/SPARK-47320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.1
>Reporter: Asif
>Priority: Major
>  Labels: pull-request-available
>
> The behaviour of Datasets involving self joins behave in an unintuitive 
> manner in terms when AnalysisException is thrown due to ambiguity and when it 
> works.
> Found situations where join order swapping causes query to throw Ambiguity 
> related exceptions which otherwise passes.  Some of the Datasets which from 
> user perspective are un-ambiguous will result in Analysis Exception getting 
> thrown.
> After testing and fixing a bug , I think the issue lies in inconsistency in 
> determining what constitutes ambiguous and what is un-ambiguous.
> There are two ways to look at resolution regarding ambiguity
> 1) ExprId of attributes : This is unintuitive approach as spark users do not 
> bother with the ExprIds
> 2) Column Extraction from the Dataset using df(col) api : Which is the user 
> visible/understandable Point of View.  So determining ambiguity should be 
> based on this. What is Logically unambiguous from users perspective ( 
> assuming its is logically correct) , should also be the basis of spark 
> product, to decide on un-ambiguity.
> For Example:
> {quote} 
>  val df1 = Seq((1, 2)).toDF("a", "b")
>   val df2 = Seq((1, 2)).toDF("aa", "bb")
>   val df1Joindf2 = df1.join(df2, df1("a") === df2("aa")).select(df1("a"),
> df2("aa"), df1("b"))
>   val df3 = df1Joindf2.join(df1, df1Joindf2("aa") === 
> df1("a")).select(df1("a"))
> {quote}
> The above code from perspective #1 should throw ambiguity exception, because 
> the join condition and projection of df3 dataframe, has df1("a)  which has 
> exprId which matches both df1Joindf2 and df1.
> But if we look is from perspective of Dataset used to get column,  which is 
> the intent of the user,  the expectation is that df1("a) should be resolved 
> to Dataset df1 being joined, and not 
> df1Joindf2.  If user intended "a" from df1Joindf2, then would have used 
> df1Joindf2("a")
> So In this case , current spark throws Exception as it is using resolution 
> based on # 1
> But the below Dataframe by the above logic, should also throw Ambiguity 
> Exception but it passes
> {quote}
> val df1 = Seq((1, 2)).toDF("a", "b")
> val df2 = Seq((1, 2)).toDF("aa", "bb")
> val df1Joindf2 = df1.join(df2, df1("a") === df2("aa")).select(df1("a"),
>   df2("aa"), df1("b"))
> df1Joindf2.join(df1, df1Joindf2("a") === df1("a"))
> {quote}
> The difference in the 2 cases is that in the first case , select is present.
> But in the 2nd query, select is not there.
> So this implies that in 1st case the df1("a") in projection is causing 
> ambiguity issue,  but same reference in 2nd case, used just in condition, is 
> considered un-ambiguous.
> IMHO ,  the ambiguity identification criteria should be based totally on #2 
> and consistently.
> In the DataFrameJoinTest and DataFrameSelfJoinTest, if we go by #2, some of 
> the tests which are being considered ambiguous ( on # 1 criteria) become 
> un-ambiguous using (#2) criteria.
> There is an existing test in DataFrameSelfJoinSuite
> {quote}
> test("SPARK-28344: fail ambiguous self join - column ref in Project") 
> val df1 = spark.range(3)
> val df2 = df1.filter($"id" > 0)
>  Assertion1  : existing 
>  assertAmbiguousSelfJoin(df1.join(df2).select(df2("id")))
>   Assertion2 :  added by me
>   assertAmbiguousSelfJoin(df2.join(df1).select(df2("id")))
> }
> {quote}
> Here the Assertion1 passes ( that is ambiguous exception is thrown)
> But the Assertion2 fails ( that is no ambiguous exception is thrown)
> The only chnage is the join order
> Logically both the assertions are invalid ( In the sense both should NOT be 
> throwing Exception as from the user's perspective there is no ambiguity.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47320) Datasets involving self joins behave in an inconsistent and unintuitive manner

2024-03-08 Thread Asif (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asif updated SPARK-47320:
-
Description: 
The behaviour of Datasets involving self joins behave in an unintuitive manner 
in terms when AnalysisException is thrown due to ambiguity and when it works.

Found situations where join order swapping causes query to throw Ambiguity 
related exceptions which otherwise passes.  Some of the Datasets which from 
user perspective are un-ambiguous will result in Analysis Exception getting 
thrown.

After testing and fixing a bug , I think the issue lies in inconsistency in 
determining what constitutes ambiguous and what is un-ambiguous.

There are two ways to look at resolution regarding ambiguity

1) ExprId of attributes : This is unintuitive approach as spark users do not 
bother with the ExprIds

2) Column Extraction from the Dataset using df(col) api : Which is the user 
visible/understandable Point of View.  So determining ambiguity should be based 
on this. What is Logically unambiguous from users perspective ( assuming its is 
logically correct) , should also be the basis of spark product, to decide on 
un-ambiguity.

For Example:
{quote} 
 val df1 = Seq((1, 2)).toDF("a", "b")
  val df2 = Seq((1, 2)).toDF("aa", "bb")
  val df1Joindf2 = df1.join(df2, df1("a") === df2("aa")).select(df1("a"),
df2("aa"), df1("b"))
  val df3 = df1Joindf2.join(df1, df1Joindf2("aa") === df1("a")).select(df1("a"))
{quote}

The above code from perspective #1 should throw ambiguity exception, because 
the join condition and projection of df3 dataframe, has df1("a)  which has 
exprId which matches both df1Joindf2 and df1.

But if we look is from perspective of Dataset used to get column,  which is the 
intent of the user,  the expectation is that df1("a) should be resolved to 
Dataset df1 being joined, and not 
df1Joindf2.  If user intended "a" from df1Joindf2, then would have used 
df1Joindf2("a")

So In this case , current spark throws Exception as it is using resolution 
based on # 1

But the below Dataframe by the above logic, should also throw Ambiguity 
Exception but it passes

{quote}
val df1 = Seq((1, 2)).toDF("a", "b")
val df2 = Seq((1, 2)).toDF("aa", "bb")
val df1Joindf2 = df1.join(df2, df1("a") === df2("aa")).select(df1("a"),
  df2("aa"), df1("b"))

df1Joindf2.join(df1, df1Joindf2("a") === df1("a"))

{quote}

The difference in the 2 cases is that in the first case , select is present.
But in the 2nd query, select is not there.

So this implies that in 1st case the df1("a") in projection is causing 
ambiguity issue,  but same reference in 2nd case, used just in condition, is 
considered un-ambiguous.


IMHO ,  the ambiguity identification criteria should be based totally on #2 and 
consistently.

In the DataFrameJoinTest and DataFrameSelfJoinTest, if we go by #2, some of the 
tests which are being considered ambiguous ( on # 1 criteria) become 
un-ambiguous using (#2) criteria.


There is an existing test in DataFrameSelfJoinSuite
{quote}

test("SPARK-28344: fail ambiguous self join - column ref in Project") 
val df1 = spark.range(3)
val df2 = df1.filter($"id" > 0)



 Assertion1  : existing 
 assertAmbiguousSelfJoin(df1.join(df2).select(df2("id")))

  Assertion2 :  added by me
  assertAmbiguousSelfJoin(df2.join(df1).select(df2("id")))

}
{quote}

Here the Assertion1 passes ( that is ambiguous exception is thrown)
But the Assertion2 fails ( that is no ambiguous exception is thrown)
The only chnage is the join order

Logically both the assertions are invalid ( In the sense both should NOT be 
throwing Exception as from the user's perspective there is no ambiguity.

  was:
The behaviour of Datasets involving self joins behave in an unintuitive manner 
in terms when AnalysisException is thrown due to ambiguity and when it works.

Found situations where join order swapping causes query to throw Ambiguity 
related exceptions which otherwise passes.  Some of the Datasets which from 
user perspective are un-ambiguous will result in Analysis Exception getting 
thrown.

After testing and fixing a bug , I think the issue lies in inconsistency in 
determining what constitutes ambiguous and what is un-ambiguous.

There are two ways to look at resolution regarding ambiguity

1) ExprId of attributes : This is unintuitive approach as spark users do not 
bother with the ExprIds

2) Column Extraction from the Dataset using df(col) api : Which is the user 
visible/understandable Point of View.  So determining ambiguity should be based 
on this. What is Logically unambiguous from users perspective ( assuming its is 
logically correct) , should also be the basis of spark product, to decide on 
un-ambiguity.

For Example:
{quote} 
 val df1 = Seq((1, 2)).toDF("a", "b")
  val df2 = Seq((1, 2)).toDF("aa", "bb")
  val df1Joindf2 = df1.join(df2, df1("a") === df2("aa")).select(df1("a"),
df2("aa"), df1("b"))
  val df3 = df1Joindf2.join(df1,

[jira] [Updated] (SPARK-47320) Datasets involving self joins behave in an inconsistent and unintuitive manner

2024-03-08 Thread Asif (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asif updated SPARK-47320:
-
Description: 
The behaviour of Datasets involving self joins behave in an unintuitive manner 
in terms when AnalysisException is thrown due to ambiguity and when it works.

Found situations where join order swapping causes query to throw Ambiguity 
related exceptions which otherwise passes.  Some of the Datasets which from 
user perspective are un-ambiguous will result in Analysis Exception getting 
thrown.

After testing and fixing a bug , I think the issue lies in inconsistency in 
determining what constitutes ambiguous and what is un-ambiguous.

There are two ways to look at resolution regarding ambiguity

1) ExprId of attributes : This is unintuitive approach as spark users do not 
bother with the ExprIds

2) Column Extraction from the Dataset using df(col) api : Which is the user 
visible/understandable Point of View.  So determining ambiguity should be based 
on this. What is Logically unambiguous from users perspective ( assuming its is 
logically correct) , should also be the basis of spark product, to decide on 
un-ambiguity.

For Example:
{quote} 
 val df1 = Seq((1, 2)).toDF("a", "b")
  val df2 = Seq((1, 2)).toDF("aa", "bb")
  val df1Joindf2 = df1.join(df2, df1("a") === df2("aa")).select(df1("a"),
df2("aa"), df1("b"))
  val df3 = df1Joindf2.join(df1, df1Joindf2("aa") === df1("a")).select(df1("a"))
{quote}

The above code from perspective #1 should throw ambiguity exception, because 
the join condition and projection of df3 dataframe, has df1("a)  which has 
exprId which matches both df1Joindf2 and df1.

But if we look is from perspective of Dataset used to get column,  which is the 
intent of the user,  the expectation is that df1("a) should be resolved to 
Dataset df1 being joined, and not 
df1Joindf2.  If user intended "a" from df1Joindf2, then would have used 
df1Joindf2("a")

So In this case , current spark throws Exception as it is using resolution 
based on # 1

But the below Dataframe by the above logic, should also throw Ambiguity 
Exception but it passes
{quote}
val df1 = Seq((1, 2)).toDF("a", "b")
val df2 = Seq((1, 2)).toDF("aa", "bb")
val df1Joindf2 = df1.join(df2, df1("a") === df2("aa")).select(df1("a"),
  df2("aa"), df1("b"))

df1Joindf2.join(df1, df1Joindf2("a") === df1("a"))
{quote}

The difference in the 2 cases is that in the first case , select is present.
But in the 2nd query, select is not there.

So this implies that in 1st case the df1("a") in projection is causing 
ambiguity issue,  but same reference in 2nd case, used just in condition, is 
considered un-ambiguous.


IMHO ,  the ambiguity identification criteria should be based totally on #2 and 
consistently.

In the DataFrameJoinTest and DataFrameSelfJoinTest, if we go by #2, some of the 
tests which are being considered ambiguous ( on # 1 criteria) become 
un-ambiguous using (#2) criteria.

for eg:
{quote}
test("SPARK-28344: fail ambiguous self join - column ref in join condition") {
val df1 = spark.range(3)
val df2 = df1.filter($"id" > 0)

@@ -118,29 +139,32 @@ class DataFrameSelfJoinSuite extends QueryTest 
with SharedSparkSession {
withSQLConf(
  SQLConf.FAIL_AMBIGUOUS_SELF_JOIN_ENABLED.key -> "true",
  SQLConf.CROSS_JOINS_ENABLED.key -> "true") {
  assertAmbiguousSelfJoin(df1.join(df2, df1("id") > df2("id")))
}
  }
{quote}

The above test should not have ambiguity exception thrown as df1("id") and 
df2("id") are un-ambiguous from perspective of Dataset


There is an existing test in DataFrameSelfJoinSuite
{quote}
test("SPARK-28344: fail ambiguous self join - column ref in Project") {
val df1 = spark.range(3)
val df2 = df1.filter($"id" > 0)



// Assertion1  : existing 
 assertAmbiguousSelfJoin(df1.join(df2).select(df2("id")))

  // Assertion2 :  added by me
  assertAmbiguousSelfJoin(df2.join(df1).select(df2("id")))

}
{quote}

Here the Assertion1 passes ( that is ambiguous exception is thrown)
But the Assertion2 fails ( that is no ambiguous exception is thrown)
The only chnage is the join order

Logically both the assertions are invalid ( In the sense both should NOT be 
throwing Exception as from the user's perspective there is no ambiguity.

  was:
The behaviour of Datasets involving self joins behave in an unintuitive manner 
in terms when AnalysisException is thrown due to ambiguity and when it works.

Found situations where join order swapping causes query to throw Ambiguity 
related exceptions which otherwise passes.  Some of the Datasets which from 
user perspective are un-ambiguous will result in Analysis Exception getting 
thrown.

After testing and fixing a bug , I think the issue lies in inconsistency in 
determining what constitutes ambiguous and what is un-ambiguous.

There are two ways to look at resolution regarding ambiguity

1) ExprId of attributes : This is unintuitive

[jira] [Updated] (SPARK-47320) Datasets involving self joins behave in an inconsistent and unintuitive manner

2024-03-08 Thread Asif (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asif updated SPARK-47320:
-
Description: 
The behaviour of Datasets involving self joins behave in an unintuitive manner 
in terms when AnalysisException is thrown due to ambiguity and when it works.

Found situations where join order swapping causes query to throw Ambiguity 
related exceptions which otherwise passes.  Some of the Datasets which from 
user perspective are un-ambiguous will result in Analysis Exception getting 
thrown.

After testing and fixing a bug , I think the issue lies in inconsistency in 
determining what constitutes ambiguous and what is un-ambiguous.

There are two ways to look at resolution regarding ambiguity

1) ExprId of attributes : This is unintuitive approach as spark users do not 
bother with the ExprIds

2) Column Extraction from the Dataset using df(col) api : Which is the user 
visible/understandable Point of View.  So determining ambiguity should be based 
on this. What is Logically unambiguous from users perspective ( assuming its is 
logically correct) , should also be the basis of spark product, to decide on 
un-ambiguity.

For Example:
{quote} 
 val df1 = Seq((1, 2)).toDF("a", "b")
  val df2 = Seq((1, 2)).toDF("aa", "bb")
  val df1Joindf2 = df1.join(df2, df1("a") === df2("aa")).select(df1("a"),
df2("aa"), df1("b"))
  val df3 = df1Joindf2.join(df1, df1Joindf2("aa") === df1("a")).select(df1("a"))
{quote}

The above code from perspective #1 should throw ambiguity exception, because 
the join condition and projection of df3 dataframe, has df1("a)  which has 
exprId which matches both df1Joindf2 and df1.

But if we look is from perspective of Dataset used to get column,  which is the 
intent of the user,  the expectation is that df1("a) should be resolved to 
Dataset df1 being joined, and not 
df1Joindf2.  If user intended "a" from df1Joindf2, then would have used 
df1Joindf2("a")

So In this case , current spark throws Exception as it is using resolution 
based on # 1

But the below Dataframe by the above logic, should also throw Ambiguity 
Exception but it passes
{quote}
val df1 = Seq((1, 2)).toDF("a", "b")
val df2 = Seq((1, 2)).toDF("aa", "bb")
val df1Joindf2 = df1.join(df2, df1("a") === df2("aa")).select(df1("a"),
  df2("aa"), df1("b"))

df1Joindf2.join(df1, df1Joindf2("a") === df1("a"))
{quote}

The difference in the 2 cases is that in the first case , select is present.
But in the 2nd query, select is not there.

So this implies that in 1st case the df1("a") in projection is causing 
ambiguity issue,  but same reference in 2nd case, used just in condition, is 
considered un-ambiguous.


IMHO ,  the ambiguity identification criteria should be based totally on #2 and 
consistently.

In the DataFrameJoinTest and DataFrameSelfJoinTest, if we go by #2, some of the 
tests which are being considered ambiguous ( on # 1 criteria) become 
un-ambiguous using (#2) criteria.

for eg:

test("SPARK-28344: fail ambiguous self join - column ref in join condition") {
val df1 = spark.range(3)
val df2 = df1.filter($"id" > 0)

@@ -118,29 +139,32 @@ class DataFrameSelfJoinSuite extends QueryTest 
with SharedSparkSession {
withSQLConf(
  SQLConf.FAIL_AMBIGUOUS_SELF_JOIN_ENABLED.key -> "true",
  SQLConf.CROSS_JOINS_ENABLED.key -> "true") {
  assertAmbiguousSelfJoin(df1.join(df2, df1("id") > df2("id")))
}
  }
{quote}
The above test should not have ambiguity exception thrown as df1("id") and 
df2("id") are un-ambiguous from perspective of Dataset


There is an existing test in DataFrameSelfJoinSuite
```
test("SPARK-28344: fail ambiguous self join - column ref in Project") {
val df1 = spark.range(3)
val df2 = df1.filter($"id" > 0)

withSQLConf(
  SQLConf.FAIL_AMBIGUOUS_SELF_JOIN_ENABLED.key -> "false",
  SQLConf.CROSS_JOINS_ENABLED.key -> "true") {
  // `df2("id")` actually points to the column of `df1`.
  checkAnswer(df1.join(df2).select(df2("id")), Seq(0, 0, 1, 1, 2, 
2).map(Row(_)))

  // Alias the dataframe and use qualified column names can fix ambiguous 
self-join.
  val aliasedDf1 = df1.alias("left")
  val aliasedDf2 = df2.as("right")
  checkAnswer(
aliasedDf1.join(aliasedDf2).select($"right.id"),
Seq(1, 1, 1, 2, 2, 2).map(Row(_)))
}

withSQLConf(
  SQLConf.FAIL_AMBIGUOUS_SELF_JOIN_ENABLED.key -> "true",
  SQLConf.CROSS_JOINS_ENABLED.key -> "true") {
 
// Assertion1  : existing 
 assertAmbiguousSelfJoin(df1.join(df2).select(df2("id")))

  // Assertion2 :  added by me
  assertAmbiguousSelfJoin(df2.join(df1).select(df2("id")))
}
}
```

Here the Assertion1 passes ( that is ambiguous exception is thrown)
But the Assertion2 fails ( that is no ambiguous exception is thrown)
The only chnage is the join order

Logically both the assertions are invalid ( In the sense both should NOT be 
throwing Exception as from the user's perspective there is no ambiguity.

  was:
The behaviour of

[jira] [Updated] (SPARK-47320) Datasets involving self joins behave in an inconsistent and unintuitive manner

2024-03-08 Thread Asif (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asif updated SPARK-47320:
-
Description: 
The behaviour of Datasets involving self joins behave in an unintuitive manner 
in terms when AnalysisException is thrown due to ambiguity and when it works.

Found situations where join order swapping causes query to throw Ambiguity 
related exceptions which otherwise passes.  Some of the Datasets which from 
user perspective are un-ambiguous will result in Analysis Exception getting 
thrown.

After testing and fixing a bug , I think the issue lies in inconsistency in 
determining what constitutes ambiguous and what is un-ambiguous.

There are two ways to look at resolution regarding ambiguity

1) ExprId of attributes : This is unintuitive approach as spark users do not 
bother with the ExprIds

2) Column Extraction from the Dataset using df(col) api : Which is the user 
visible/understandable Point of View.  So determining ambiguity should be based 
on this. What is Logically unambiguous from users perspective ( assuming its is 
logically correct) , should also be the basis of spark product, to decide on 
un-ambiguity.

For Example:
{quote} 
 val df1 = Seq((1, 2)).toDF("a", "b")
  val df2 = Seq((1, 2)).toDF("aa", "bb")
  val df1Joindf2 = df1.join(df2, df1("a") === df2("aa")).select(df1("a"),
df2("aa"), df1("b"))
  val df3 = df1Joindf2.join(df1, df1Joindf2("aa") === df1("a")).select(df1("a"))
{quote}

The above code from perspective #1 should throw ambiguity exception, because 
the join condition and projection of df3 dataframe, has df1("a)  which has 
exprId which matches both df1Joindf2 and df1.

But if we look is from perspective of Dataset used to get column,  which is the 
intent of the user,  the expectation is that df1("a) should be resolved to 
Dataset df1 being joined, and not 
df1Joindf2.  If user intended "a" from df1Joindf2, then would have used 
df1Joindf2("a")

So In this case , current spark throws Exception as it is using resolution 
based on # 1

But the below Dataframe by the above logic, should also throw Ambiguity 
Exception but it passes
{quote}
val df1 = Seq((1, 2)).toDF("a", "b")
val df2 = Seq((1, 2)).toDF("aa", "bb")
val df1Joindf2 = df1.join(df2, df1("a") === df2("aa")).select(df1("a"),
  df2("aa"), df1("b"))

df1Joindf2.join(df1, df1Joindf2("a") === df1("a"))
{quote}

The difference in the 2 cases is that in the first case , select is present.
But in the 2nd query, select is not there.

So this implies that in 1st case the df1("a") in projection is causing 
ambiguity issue,  but same reference in 2nd case, used just in condition, is 
considered un-ambiguous.


IMHO ,  the ambiguity identification criteria should be based totally on #2 and 
consistently.

In the DataFrameJoinTest and DataFrameSelfJoinTest, if we go by #2, some of the 
tests which are being considered ambiguous ( on # 1 criteria) become 
un-ambiguous using (#2) criteria.

for eg:
{quote}
test("SPARK-28344: fail ambiguous self join - column ref in join condition") {
val df1 = spark.range(3)
val df2 = df1.filter($"id" > 0)

@@ -118,29 +139,32 @@ class DataFrameSelfJoinSuite extends QueryTest 
with SharedSparkSession {
withSQLConf(
  SQLConf.FAIL_AMBIGUOUS_SELF_JOIN_ENABLED.key -> "true",
  SQLConf.CROSS_JOINS_ENABLED.key -> "true") {
  assertAmbiguousSelfJoin(df1.join(df2, df1("id") > df2("id")))
}
  }
{quote}
The above test should not have ambiguity exception thrown as df1("id") and 
df2("id") are un-ambiguous from perspective of Dataset


There is an existing test in DataFrameSelfJoinSuite
{quote}
test("SPARK-28344: fail ambiguous self join - column ref in Project") {
val df1 = spark.range(3)
val df2 = df1.filter($"id" > 0)

withSQLConf(
  SQLConf.FAIL_AMBIGUOUS_SELF_JOIN_ENABLED.key -> "false",
  SQLConf.CROSS_JOINS_ENABLED.key -> "true") {
  // `df2("id")` actually points to the column of `df1`.
  checkAnswer(df1.join(df2).select(df2("id")), Seq(0, 0, 1, 1, 2, 
2).map(Row(_)))

  // Alias the dataframe and use qualified column names can fix ambiguous 
self-join.
  val aliasedDf1 = df1.alias("left")
  val aliasedDf2 = df2.as("right")
  checkAnswer(
aliasedDf1.join(aliasedDf2).select($"right.id"),
Seq(1, 1, 1, 2, 2, 2).map(Row(_)))
}

withSQLConf(
  SQLConf.FAIL_AMBIGUOUS_SELF_JOIN_ENABLED.key -> "true",
  SQLConf.CROSS_JOINS_ENABLED.key -> "true") {
 
// Assertion1  : existing 
 assertAmbiguousSelfJoin(df1.join(df2).select(df2("id")))

  // Assertion2 :  added by me
  assertAmbiguousSelfJoin(df2.join(df1).select(df2("id")))
}
}
{quote}

Here the Assertion1 passes ( that is ambiguous exception is thrown)
But the Assertion2 fails ( that is no ambiguous exception is thrown)
The only chnage is the join order

Logically both the assertions are invalid ( In the sense both should NOT be 
throwing Exception as from the user's perspective there is no ambiguity.

  was:
The

[jira] [Updated] (SPARK-47320) Datasets involving self joins behave in an inconsistent and unintuitive manner

2024-03-08 Thread Asif (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asif updated SPARK-47320:
-
Description: 
The behaviour of Datasets involving self joins behave in an unintuitive manner 
in terms when AnalysisException is thrown due to ambiguity and when it works.

Found situations where join order swapping causes query to throw Ambiguity 
related exceptions which otherwise passes.  Some of the Datasets which from 
user perspective are un-ambiguous will result in Analysis Exception getting 
thrown.

After testing and fixing a bug , I think the issue lies in inconsistency in 
determining what constitutes ambiguous and what is un-ambiguous.

There are two ways to look at resolution regarding ambiguity

1) ExprId of attributes : This is unintuitive approach as spark users do not 
bother with the ExprIds

2) Column Extraction from the Dataset using df(col) api : Which is the user 
visible/understandable Point of View.  So determining ambiguity should be based 
on this. What is Logically unambiguous from users perspective ( assuming its is 
logically correct) , should also be the basis of spark product, to decide on 
un-ambiguity.

For Example:
{quote} 
 val df1 = Seq((1, 2)).toDF("a", "b")
  val df2 = Seq((1, 2)).toDF("aa", "bb")
  val df1Joindf2 = df1.join(df2, df1("a") === df2("aa")).select(df1("a"),
df2("aa"), df1("b"))
  val df3 = df1Joindf2.join(df1, df1Joindf2("aa") === df1("a")).select(df1("a"))
{quote}

The above code from perspective #1 should throw ambiguity exception, because 
the join condition and projection of df3 dataframe, has df1("a)  which has 
exprId which matches both df1Joindf2 and df1.

But if we look is from perspective of Dataset used to get column,  which is the 
intent of the user,  the expectation is that df1("a) should be resolved to 
Dataset df1 being joined, and not 
df1Joindf2.  If user intended "a" from df1Joindf2, then would have used 
df1Joindf2("a")

So In this case , current spark throws Exception as it is using resolution 
based on # 1

But the below Dataframe by the above logic, should also throw Ambiguity 
Exception but it passes
{quote}
val df1 = Seq((1, 2)).toDF("a", "b")
val df2 = Seq((1, 2)).toDF("aa", "bb")
val df1Joindf2 = df1.join(df2, df1("a") === df2("aa")).select(df1("a"),
  df2("aa"), df1("b"))

df1Joindf2.join(df1, df1Joindf2("a") === df1("a"))
{quote}

The difference in the 2 cases is that in the first case , select is present.
But in the 2nd query, select is not there.

So this implies that in 1st case the df1("a") in projection is causing 
ambiguity issue,  but same reference in 2nd case, used just in condition, is 
considered un-ambiguous.


IMHO ,  the ambiguity identification criteria should be based totally on #2 and 
consistently.

In the DataFrameJoinTest and DataFrameSelfJoinTest, if we go by #2, some of the 
tests which are being considered ambiguous ( on # 1 criteria) become 
un-ambiguous using (#2) criteria.

for eg:
{quote}
test("SPARK-28344: fail ambiguous self join - column ref in join condition") {
val df1 = spark.range(3)
val df2 = df1.filter($"id" > 0)

@@ -118,29 +139,32 @@ class DataFrameSelfJoinSuite extends QueryTest 
with SharedSparkSession {
withSQLConf(
  SQLConf.FAIL_AMBIGUOUS_SELF_JOIN_ENABLED.key -> "true",
  SQLConf.CROSS_JOINS_ENABLED.key -> "true") {
  assertAmbiguousSelfJoin(df1.join(df2, df1("id") > df2("id")))
}
  }
{quote}
The above test should not have ambiguity exception thrown as df1("id") and 
df2("id") are un-ambiguous from perspective of Dataset


There is an existing test in DataFrameSelfJoinSuite
`
test("SPARK-28344: fail ambiguous self join - column ref in Project") {
val df1 = spark.range(3)
val df2 = df1.filter($"id" > 0)

withSQLConf(
  SQLConf.FAIL_AMBIGUOUS_SELF_JOIN_ENABLED.key -> "false",
  SQLConf.CROSS_JOINS_ENABLED.key -> "true") {
  // `df2("id")` actually points to the column of `df1`.
  checkAnswer(df1.join(df2).select(df2("id")), Seq(0, 0, 1, 1, 2, 
2).map(Row(_)))

  // Alias the dataframe and use qualified column names can fix ambiguous 
self-join.
  val aliasedDf1 = df1.alias("left")
  val aliasedDf2 = df2.as("right")
  checkAnswer(
aliasedDf1.join(aliasedDf2).select($"right.id"),
Seq(1, 1, 1, 2, 2, 2).map(Row(_)))
}

withSQLConf(
  SQLConf.FAIL_AMBIGUOUS_SELF_JOIN_ENABLED.key -> "true",
  SQLConf.CROSS_JOINS_ENABLED.key -> "true") {
 
// Assertion1  : existing 
 assertAmbiguousSelfJoin(df1.join(df2).select(df2("id")))

  // Assertion2 :  added by me
  assertAmbiguousSelfJoin(df2.join(df1).select(df2("id")))
}
}
`
Here the Assertion1 passes ( that is ambiguous exception is thrown)
But the Assertion2 fails ( that is no ambiguous exception is thrown)
The only chnage is the join order

Logically both the assertions are invalid ( In the sense both should NOT be 
throwing Exception as from the user's perspective there is no ambiguity.

  was:
The behaviour of

[jira] [Commented] (SPARK-47320) Datasets involving self joins behave in an inconsistent and unintuitive manner

2024-03-08 Thread Asif (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-47320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17824877#comment-17824877
 ] 

Asif commented on SPARK-47320:
--

Opened following PR
[https://github.com/apache/spark/pull/45446|https://github.com/apache/spark/pull/45446]

> Datasets involving self joins behave in an inconsistent and unintuitive  
> manner 
> 
>
> Key: SPARK-47320
> URL: https://issues.apache.org/jira/browse/SPARK-47320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.1
>Reporter: Asif
>Priority: Major
>
> The behaviour of Datasets involving self joins behave in an unintuitive 
> manner in terms when AnalysisException is thrown due to ambiguity and when it 
> works.
> Found situations where join order swapping causes query to throw Ambiguity 
> related exceptions which otherwise passes.  Some of the Datasets which from 
> user perspective are un-ambiguous will result in Analysis Exception getting 
> thrown.
> After testing and fixing a bug , I think the issue lies in inconsistency in 
> determining what constitutes ambiguous and what is un-ambiguous.
> There are two ways to look at resolution regarding ambiguity
> 1) ExprId of attributes : This is unintuitive approach as spark users do not 
> bother with the ExprIds
> 2) Column Extraction from the Dataset using df(col) api : Which is the user 
> visible/understandable Point of View.  So determining ambiguity should be 
> based on this. What is Logically unambiguous from users perspective ( 
> assuming its is logically correct) , should also be the basis of spark 
> product, to decide on un-ambiguity.
> For Example:
> {quote} 
>  val df1 = Seq((1, 2)).toDF("a", "b")
>   val df2 = Seq((1, 2)).toDF("aa", "bb")
>   val df1Joindf2 = df1.join(df2, df1("a") === df2("aa")).select(df1("a"),
> df2("aa"), df1("b"))
>   val df3 = df1Joindf2.join(df1, df1Joindf2("aa") === 
> df1("a")).select(df1("a"))
> {quote}
> The above code from perspective #1 should throw ambiguity exception, because 
> the join condition and projection of df3 dataframe, has df1("a)  which has 
> exprId which matches both df1Joindf2 and df1.
> But if we look is from perspective of Dataset used to get column,  which is 
> the intent of the user,  the expectation is that df1("a) should be resolved 
> to Dataset df1 being joined, and not 
> df1Joindf2.  If user intended "a" from df1Joindf2, then would have used 
> df1Joindf2("a")
> So In this case , current spark throws Exception as it is using resolution 
> based on # 1
> But the below Dataframe by the above logic, should also throw Ambiguity 
> Exception but it passes
> {quote}
> val df1 = Seq((1, 2)).toDF("a", "b")
> val df2 = Seq((1, 2)).toDF("aa", "bb")
> val df1Joindf2 = df1.join(df2, df1("a") === df2("aa")).select(df1("a"),
>   df2("aa"), df1("b"))
> df1Joindf2.join(df1, df1Joindf2("a") === df1("a"))
> {quote}
> The difference in the 2 cases is that in the first case , select is present.
> But in the 2nd query, select is not there.
> So this implies that in 1st case the df1("a") in projection is causing 
> ambiguity issue,  but same reference in 2nd case, used just in condition, is 
> considered un-ambiguous.
> IMHO ,  the ambiguity identification criteria should be based totally on #2 
> and consistently.
> In the DataFrameJoinTest and DataFrameSelfJoinTest, if we go by #2, some of 
> the tests which are being considered ambiguous ( on # 1 criteria) become 
> un-ambiguous using (#2) criteria.
> for eg:
> {quote}
> test("SPARK-28344: fail ambiguous self join - column ref in join condition") {
> val df1 = spark.range(3)
> val df2 = df1.filter($"id" > 0)
>   @@ -118,29 +139,32 @@ class DataFrameSelfJoinSuite extends QueryTest 
> with SharedSparkSession {
> withSQLConf(
>   SQLConf.FAIL_AMBIGUOUS_SELF_JOIN_ENABLED.key -> "true",
>   SQLConf.CROSS_JOINS_ENABLED.key -> "true") {
>   assertAmbiguousSelfJoin(df1.join(df2, df1("id") > df2("id")))
> }
>   }
> {quote}
> The above test should not have ambiguity exception thrown as df1("id") and 
> df2("id") are un-ambiguous from perspective of Dataset



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-47320) Datasets involving self joins behave in an inconsistent and unintuitive manner

2024-03-07 Thread Asif (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-47320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17824589#comment-17824589
 ] 

Asif commented on SPARK-47320:
--

will be linking the bug to an open PR

> Datasets involving self joins behave in an inconsistent and unintuitive  
> manner 
> 
>
> Key: SPARK-47320
> URL: https://issues.apache.org/jira/browse/SPARK-47320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.1
>Reporter: Asif
>Priority: Major
>
> The behaviour of Datasets involving self joins behave in an unintuitive 
> manner in terms when AnalysisException is thrown due to ambiguity and when it 
> works.
> Found situations where join order swapping causes query to throw Ambiguity 
> related exceptions which otherwise passes.  Some of the Datasets which from 
> user perspective are un-ambiguous will result in Analysis Exception getting 
> thrown.
> After testing and fixing a bug , I think the issue lies in inconsistency in 
> determining what constitutes ambiguous and what is un-ambiguous.
> There are two ways to look at resolution regarding ambiguity
> 1) ExprId of attributes : This is unintuitive approach as spark users do not 
> bother with the ExprIds
> 2) Column Extraction from the Dataset using df(col) api : Which is the user 
> visible/understandable Point of View.  So determining ambiguity should be 
> based on this. What is Logically unambiguous from users perspective ( 
> assuming its is logically correct) , should also be the basis of spark 
> product, to decide on un-ambiguity.
> For Example:
> {quote} 
>  val df1 = Seq((1, 2)).toDF("a", "b")
>   val df2 = Seq((1, 2)).toDF("aa", "bb")
>   val df1Joindf2 = df1.join(df2, df1("a") === df2("aa")).select(df1("a"),
> df2("aa"), df1("b"))
>   val df3 = df1Joindf2.join(df1, df1Joindf2("aa") === 
> df1("a")).select(df1("a"))
> {quote}
> The above code from perspective #1 should throw ambiguity exception, because 
> the join condition and projection of df3 dataframe, has df1("a)  which has 
> exprId which matches both df1Joindf2 and df1.
> But if we look is from perspective of Dataset used to get column,  which is 
> the intent of the user,  the expectation is that df1("a) should be resolved 
> to Dataset df1 being joined, and not 
> df1Joindf2.  If user intended "a" from df1Joindf2, then would have used 
> df1Joindf2("a")
> So In this case , current spark throws Exception as it is using resolution 
> based on # 1
> But the below Dataframe by the above logic, should also throw Ambiguity 
> Exception but it passes
> {quote}
> val df1 = Seq((1, 2)).toDF("a", "b")
> val df2 = Seq((1, 2)).toDF("aa", "bb")
> val df1Joindf2 = df1.join(df2, df1("a") === df2("aa")).select(df1("a"),
>   df2("aa"), df1("b"))
> df1Joindf2.join(df1, df1Joindf2("a") === df1("a"))
> {quote}
> The difference in the 2 cases is that in the first case , select is present.
> But in the 2nd query, select is not there.
> So this implies that in 1st case the df1("a") in projection is causing 
> ambiguity issue,  but same reference in 2nd case, used just in condition, is 
> considered un-ambiguous.
> IMHO ,  the ambiguity identification criteria should be based totally on #2 
> and consistently.
> In the DataFrameJoinTest and DataFrameSelfJoinTest, if we go by #2, some of 
> the tests which are being considered ambiguous ( on # 1 criteria) become 
> un-ambiguous using (#2) criteria.
> for eg:
> {quote}
> test("SPARK-28344: fail ambiguous self join - column ref in join condition") {
> val df1 = spark.range(3)
> val df2 = df1.filter($"id" > 0)
>   @@ -118,29 +139,32 @@ class DataFrameSelfJoinSuite extends QueryTest 
> with SharedSparkSession {
> withSQLConf(
>   SQLConf.FAIL_AMBIGUOUS_SELF_JOIN_ENABLED.key -> "true",
>   SQLConf.CROSS_JOINS_ENABLED.key -> "true") {
>   assertAmbiguousSelfJoin(df1.join(df2, df1("id") > df2("id")))
> }
>   }
> {quote}
> The above test should not have ambiguity exception thrown as df1("id") and 
> df2("id") are un-ambiguous from perspective of Dataset



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-47320) Datasets involving self joins behave in an inconsistent and unintuitive manner

2024-03-07 Thread Asif (Jira)

Asif created SPARK-47320:


 Summary: Datasets involving self joins behave in an inconsistent 
and unintuitive  manner 
 Key: SPARK-47320
 URL: https://issues.apache.org/jira/browse/SPARK-47320
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.5.1
Reporter: Asif


The behaviour of Datasets involving self joins behave in an unintuitive manner 
in terms when AnalysisException is thrown due to ambiguity and when it works.

Found situations where join order swapping causes query to throw Ambiguity 
related exceptions which otherwise passes.  Some of the Datasets which from 
user perspective are un-ambiguous will result in Analysis Exception getting 
thrown.

After testing and fixing a bug , I think the issue lies in inconsistency in 
determining what constitutes ambiguous and what is un-ambiguous.

There are two ways to look at resolution regarding ambiguity

1) ExprId of attributes : This is unintuitive approach as spark users do not 
bother with the ExprIds

2) Column Extraction from the Dataset using df(col) api : Which is the user 
visible/understandable Point of View.  So determining ambiguity should be based 
on this. What is Logically unambiguous from users perspective ( assuming its is 
logically correct) , should also be the basis of spark product, to decide on 
un-ambiguity.

For Example:
{quote} 
 val df1 = Seq((1, 2)).toDF("a", "b")
  val df2 = Seq((1, 2)).toDF("aa", "bb")
  val df1Joindf2 = df1.join(df2, df1("a") === df2("aa")).select(df1("a"),
df2("aa"), df1("b"))
  val df3 = df1Joindf2.join(df1, df1Joindf2("aa") === df1("a")).select(df1("a"))
{quote}

The above code from perspective #1 should throw ambiguity exception, because 
the join condition and projection of df3 dataframe, has df1("a)  which has 
exprId which matches both df1Joindf2 and df1.

But if we look is from perspective of Dataset used to get column,  which is the 
intent of the user,  the expectation is that df1("a) should be resolved to 
Dataset df1 being joined, and not 
df1Joindf2.  If user intended "a" from df1Joindf2, then would have used 
df1Joindf2("a")

So In this case , current spark throws Exception as it is using resolution 
based on # 1

But the below Dataframe by the above logic, should also throw Ambiguity 
Exception but it passes
{quote}
val df1 = Seq((1, 2)).toDF("a", "b")
val df2 = Seq((1, 2)).toDF("aa", "bb")
val df1Joindf2 = df1.join(df2, df1("a") === df2("aa")).select(df1("a"),
  df2("aa"), df1("b"))

df1Joindf2.join(df1, df1Joindf2("a") === df1("a"))
{quote}

The difference in the 2 cases is that in the first case , select is present.
But in the 2nd query, select is not there.

So this implies that in 1st case the df1("a") in projection is causing 
ambiguity issue,  but same reference in 2nd case, used just in condition, is 
considered un-ambiguous.


IMHO ,  the ambiguity identification criteria should be based totally on #2 and 
consistently.

In the DataFrameJoinTest and DataFrameSelfJoinTest, if we go by #2, some of the 
tests which are being considered ambiguous ( on # 1 criteria) become 
un-ambiguous using (#2) criteria.

for eg:
{quote}
test("SPARK-28344: fail ambiguous self join - column ref in join condition") {
val df1 = spark.range(3)
val df2 = df1.filter($"id" > 0)

@@ -118,29 +139,32 @@ class DataFrameSelfJoinSuite extends QueryTest 
with SharedSparkSession {
withSQLConf(
  SQLConf.FAIL_AMBIGUOUS_SELF_JOIN_ENABLED.key -> "true",
  SQLConf.CROSS_JOINS_ENABLED.key -> "true") {
  assertAmbiguousSelfJoin(df1.join(df2, df1("id") > df2("id")))
}
  }
{quote}
The above test should not have ambiguity exception thrown as df1("id") and 
df2("id") are un-ambiguous from perspective of Dataset



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39441) Speed up DeduplicateRelations

2024-03-06 Thread Asif (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17824102#comment-17824102
 ] 

Asif commented on SPARK-39441:
--

this issue should be resolved by the PR for ticket 
[https://issues.apache.org/jira/browse/SPARK-45959|https://issues.apache.org/jira/browse/SPARK-45959]

> Speed up DeduplicateRelations
> -
>
> Key: SPARK-39441
> URL: https://issues.apache.org/jira/browse/SPARK-39441
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.1
>Reporter: Allison Wang
>Assignee: Allison Wang
>Priority: Major
> Fix For: 3.4.0
>
>
> Speed up the Analyzer rule DeduplicateRelations



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-39441) Speed up DeduplicateRelations

2024-03-06 Thread Asif (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17824102#comment-17824102
 ] 

Asif edited comment on SPARK-39441 at 3/6/24 5:33 PM:
--

this issue should be resolved by the PR for ticket 
https://issues.apache.org/jira/browse/SPARK-45959,

but it is still in the open state


was (Author: ashahid7):
this issue should be resolved by the PR for ticket 
[https://issues.apache.org/jira/browse/SPARK-45959|https://issues.apache.org/jira/browse/SPARK-45959]

> Speed up DeduplicateRelations
> -
>
> Key: SPARK-39441
> URL: https://issues.apache.org/jira/browse/SPARK-39441
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.1
>Reporter: Allison Wang
>Assignee: Allison Wang
>Priority: Major
> Fix For: 3.4.0
>
>
> Speed up the Analyzer rule DeduplicateRelations



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-33152) SPIP: Constraint Propagation code causes OOM issues or increasing compilation time to hours

2024-03-05 Thread Asif (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17823510#comment-17823510
 ] 

Asif edited comment on SPARK-33152 at 3/5/24 6:43 PM:
--

[~tedjenks] .. Unfortunately I am not a committer. As part of workday , I had 
opened this Jira and opened a PR to fix this issue completely which required a 
different logic. The changes are extensive and they were never reviewed or 
dicussed by OS community. This PR has been in production since past 3 years at 
Workday. 

As to why a check is not added, etc,.,:

That would be unclean and as such is not easy to implement also in current 
codebase, because it will result in various other issues like new redundant 
filters being inferred and other messy bugs as the constraint code is sensitive 
to constraints coming from each node below and the constraints available at 
current node, to decide whether to create new filters or not.

Constrainst are created per operator node ( project, filter etc) and arbitrary 
putting a limit on constraints at a given operator , will impact the new 
filters being created.


was (Author: ashahid7):
[~tedjenks] .. Unfortunately I am not a committer. As part of workday , I had 
opened this Jira and opened a PR to fix this issue completely which required a 
different logic. The changes are extensive and they were never reviewed or 
dicussed by OS community. This PR has been in production since past 3 years at 
Workday. 

As to why a check is not added, etc,.,:

That would be unclean and as such is not easy to implement also in current 
codebase, because it will result in various other issues like new/wrong filters 
being inferred and other messy bugs as the constraint code is sensitive to 
constraints coming from each node below and the constraints available at 
current node, to decide whether to create new filters or not.

Constrainst are created per operator node ( project, filter etc) and arbitrary 
putting a limit on constraints at a given operator , will impact the new 
filters being created.

> SPIP: Constraint Propagation code causes OOM issues or increasing compilation 
> time to hours
> ---
>
> Key: SPARK-33152
> URL: https://issues.apache.org/jira/browse/SPARK-33152
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Asif
>Priority: Major
>  Labels: SPIP
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> h2. Q1. What are you trying to do? Articulate your objectives using 
> absolutely no jargon.
> Proposing new algorithm to create, store and use constraints for removing 
> redundant filters & inferring new filters.
> The current algorithm has subpar performance in complex expression scenarios 
> involving aliases( with certain use cases the compilation time can go into 
> hours), potential to cause OOM, may miss removing redundant filters in 
> different scenarios, may miss creating IsNotNull constraints in different 
> scenarios, does not push compound predicates in Join.
>  # This issue if not fixed can cause OutOfMemory issue or unacceptable query 
> compilation times.
> Have added a test "plan equivalence with case statements and performance 
> comparison with benefit of more than 10x conservatively" in 
> org.apache.spark.sql.catalyst.plans.OptimizedConstraintPropagationSuite. 
> *With this PR the compilation time is 247 ms vs 13958 ms without the change*
>  # It is more effective in filter pruning as is evident in some of the tests 
> in org.apache.spark.sql.catalyst.plans.OptimizedConstraintPropagationSuite 
> where current code is not able to identify the redundant filter in some cases.
>  # It is able to generate a better optimized plan for join queries as it can 
> push compound predicates.
>  # The current logic can miss a lot of possible cases of removing redundant 
> predicates, as it fails to take into account if same attribute or its aliases 
> are repeated multiple times in a complex expression.
>  # There are cases where some of the optimizer rules involving removal of 
> redundant predicates fail to remove on the basis of constraint data. In some 
> cases the rule works, just by the virtue of previous rules helping it out to 
> cover the inaccuracy. That the ConstraintPropagation rule & its function of 
> removal of redundant filters & addition of new inferred filters is dependent 
> on the working of some of the other unrelated previous optimizer rules is 
> behaving, is indicative of issues.
>  # It does away with all the EqualNullSafe constraints as this logic does not 
> need those constraints to be created.
>  # There is at least one test in existing ConstraintPropagationSuite which is 
> missing a IsNotNull constraints because the code

[jira] [Commented] (SPARK-33152) SPIP: Constraint Propagation code causes OOM issues or increasing compilation time to hours

2024-03-05 Thread Asif (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17823512#comment-17823512
 ] 

Asif commented on SPARK-33152:
--

other than using my PR, the safe option would be to disable constraint 
propagation rule via sql conf. though that would mean loosing optimizations 
related to push down of new filters on the other side of join legs etc,

> SPIP: Constraint Propagation code causes OOM issues or increasing compilation 
> time to hours
> ---
>
> Key: SPARK-33152
> URL: https://issues.apache.org/jira/browse/SPARK-33152
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Asif
>Priority: Major
>  Labels: SPIP
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> h2. Q1. What are you trying to do? Articulate your objectives using 
> absolutely no jargon.
> Proposing new algorithm to create, store and use constraints for removing 
> redundant filters & inferring new filters.
> The current algorithm has subpar performance in complex expression scenarios 
> involving aliases( with certain use cases the compilation time can go into 
> hours), potential to cause OOM, may miss removing redundant filters in 
> different scenarios, may miss creating IsNotNull constraints in different 
> scenarios, does not push compound predicates in Join.
>  # This issue if not fixed can cause OutOfMemory issue or unacceptable query 
> compilation times.
> Have added a test "plan equivalence with case statements and performance 
> comparison with benefit of more than 10x conservatively" in 
> org.apache.spark.sql.catalyst.plans.OptimizedConstraintPropagationSuite. 
> *With this PR the compilation time is 247 ms vs 13958 ms without the change*
>  # It is more effective in filter pruning as is evident in some of the tests 
> in org.apache.spark.sql.catalyst.plans.OptimizedConstraintPropagationSuite 
> where current code is not able to identify the redundant filter in some cases.
>  # It is able to generate a better optimized plan for join queries as it can 
> push compound predicates.
>  # The current logic can miss a lot of possible cases of removing redundant 
> predicates, as it fails to take into account if same attribute or its aliases 
> are repeated multiple times in a complex expression.
>  # There are cases where some of the optimizer rules involving removal of 
> redundant predicates fail to remove on the basis of constraint data. In some 
> cases the rule works, just by the virtue of previous rules helping it out to 
> cover the inaccuracy. That the ConstraintPropagation rule & its function of 
> removal of redundant filters & addition of new inferred filters is dependent 
> on the working of some of the other unrelated previous optimizer rules is 
> behaving, is indicative of issues.
>  # It does away with all the EqualNullSafe constraints as this logic does not 
> need those constraints to be created.
>  # There is at least one test in existing ConstraintPropagationSuite which is 
> missing a IsNotNull constraints because the code incorrectly generated a 
> EqualsNullSafeConstraint instead of EqualTo constraint, when using the 
> existing Constraints code. With these changes, the test correctly creates an 
> EqualTo constraint, resulting in an inferred IsNotNull constraint
>  # It does away with the current combinatorial logic of evaluation all the 
> constraints can cause compilation to run into hours or cause OOM. The number 
> of constraints stored is exactly the same as the number of filters encountered
> h2. Q2. What problem is this proposal NOT designed to solve?
> It mainly focuses on compile time performance, but in some cases can benefit 
> run time characteristics too, like inferring IsNotNull filter or pushing down 
> compound predicates on the join, which currently may get missed/ does not 
> happen , respectively, by the present code.
> h2. Q3. How is it done today, and what are the limits of current practice?
> Current ConstraintsPropagation code, pessimistically tries to generates all 
> the possible combinations of constraints , based on the aliases ( even then 
> it may miss a lot of combinations if the expression is a complex expression 
> involving same attribute repeated multiple times within the expression and 
> there are many aliases to that column). There are query plans in our 
> production env, which can result in intermediate number of constraints going 
> into hundreds of thousands, causing OOM or taking time running into hours. 
> Also there are cases where it incorrectly generates an EqualNullSafe 
> constraint instead of EqualTo constraint , thus missing a possible IsNull 
> constraint on column. 
> Also it only pushes single column predicate on the other

[jira] [Commented] (SPARK-33152) SPIP: Constraint Propagation code causes OOM issues or increasing compilation time to hours

2024-03-05 Thread Asif (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17823510#comment-17823510
 ] 

Asif commented on SPARK-33152:
--

[~tedjenks] .. Unfortunately I am not a committer. As part of workday , I had 
opened this Jira and opened a PR to fix this issue completely which required a 
different logic. The changes are extensive and they were never reviewed or 
dicussed by OS community. This PR has been in production since past 3 years at 
Workday. 

As to why a check is not added, etc,.,:

That would be unclean and as such is not easy to implement also in current 
codebase, because it will result in various other issues like new/wrong filters 
being inferred and other messy bugs as the constraint code is sensitive to 
constraints coming from each node below and the constraints available at 
current node, to decide whether to create new filters or not.

Constrainst are created per operator node ( project, filter etc) and arbitrary 
putting a limit on constraints at a given operator , will impact the new 
filters being created.

> SPIP: Constraint Propagation code causes OOM issues or increasing compilation 
> time to hours
> ---
>
> Key: SPARK-33152
> URL: https://issues.apache.org/jira/browse/SPARK-33152
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Asif
>Priority: Major
>  Labels: SPIP
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> h2. Q1. What are you trying to do? Articulate your objectives using 
> absolutely no jargon.
> Proposing new algorithm to create, store and use constraints for removing 
> redundant filters & inferring new filters.
> The current algorithm has subpar performance in complex expression scenarios 
> involving aliases( with certain use cases the compilation time can go into 
> hours), potential to cause OOM, may miss removing redundant filters in 
> different scenarios, may miss creating IsNotNull constraints in different 
> scenarios, does not push compound predicates in Join.
>  # This issue if not fixed can cause OutOfMemory issue or unacceptable query 
> compilation times.
> Have added a test "plan equivalence with case statements and performance 
> comparison with benefit of more than 10x conservatively" in 
> org.apache.spark.sql.catalyst.plans.OptimizedConstraintPropagationSuite. 
> *With this PR the compilation time is 247 ms vs 13958 ms without the change*
>  # It is more effective in filter pruning as is evident in some of the tests 
> in org.apache.spark.sql.catalyst.plans.OptimizedConstraintPropagationSuite 
> where current code is not able to identify the redundant filter in some cases.
>  # It is able to generate a better optimized plan for join queries as it can 
> push compound predicates.
>  # The current logic can miss a lot of possible cases of removing redundant 
> predicates, as it fails to take into account if same attribute or its aliases 
> are repeated multiple times in a complex expression.
>  # There are cases where some of the optimizer rules involving removal of 
> redundant predicates fail to remove on the basis of constraint data. In some 
> cases the rule works, just by the virtue of previous rules helping it out to 
> cover the inaccuracy. That the ConstraintPropagation rule & its function of 
> removal of redundant filters & addition of new inferred filters is dependent 
> on the working of some of the other unrelated previous optimizer rules is 
> behaving, is indicative of issues.
>  # It does away with all the EqualNullSafe constraints as this logic does not 
> need those constraints to be created.
>  # There is at least one test in existing ConstraintPropagationSuite which is 
> missing a IsNotNull constraints because the code incorrectly generated a 
> EqualsNullSafeConstraint instead of EqualTo constraint, when using the 
> existing Constraints code. With these changes, the test correctly creates an 
> EqualTo constraint, resulting in an inferred IsNotNull constraint
>  # It does away with the current combinatorial logic of evaluation all the 
> constraints can cause compilation to run into hours or cause OOM. The number 
> of constraints stored is exactly the same as the number of filters encountered
> h2. Q2. What problem is this proposal NOT designed to solve?
> It mainly focuses on compile time performance, but in some cases can benefit 
> run time characteristics too, like inferring IsNotNull filter or pushing down 
> compound predicates on the join, which currently may get missed/ does not 
> happen , respectively, by the present code.
> h2. Q3. How is it done today, and what are the limits of current practice?
> Current ConstraintsPropagation code, pessimistically tries to generates all 
> the

[jira] [Commented] (SPARK-33152) SPIP: Constraint Propagation code causes OOM issues or increasing compilation time to hours

2024-03-04 Thread Asif (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17823344#comment-17823344
 ] 

Asif commented on SPARK-33152:
--

[~tedjenks]  The issue has always been there  because of the way constraint 
prop rule works ( due to it permutational logic). A possible cause why it might 
have become more common could be due to some changes to fix the previously 
undetected constraints . The more robust the code becomes in detecting the 
constraints,  chances are it would increase the cost of over all constraints 
code drastically.

For eg if we have a projection with multiple aliases, and say these aliases are 
used in case when expressions and involve functions taking these aliases, the 
number of constraints created would be enormous and even then the code ( 
atleast in 3.2) would not be able to cover all the possible constraints..

so my guess is that any changes to increase the sensitivity of constraints 
identification will affect the cost of the evaluation of constraints..

> SPIP: Constraint Propagation code causes OOM issues or increasing compilation 
> time to hours
> ---
>
> Key: SPARK-33152
> URL: https://issues.apache.org/jira/browse/SPARK-33152
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Asif
>Priority: Major
>  Labels: SPIP
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> h2. Q1. What are you trying to do? Articulate your objectives using 
> absolutely no jargon.
> Proposing new algorithm to create, store and use constraints for removing 
> redundant filters & inferring new filters.
> The current algorithm has subpar performance in complex expression scenarios 
> involving aliases( with certain use cases the compilation time can go into 
> hours), potential to cause OOM, may miss removing redundant filters in 
> different scenarios, may miss creating IsNotNull constraints in different 
> scenarios, does not push compound predicates in Join.
>  # This issue if not fixed can cause OutOfMemory issue or unacceptable query 
> compilation times.
> Have added a test "plan equivalence with case statements and performance 
> comparison with benefit of more than 10x conservatively" in 
> org.apache.spark.sql.catalyst.plans.OptimizedConstraintPropagationSuite. 
> *With this PR the compilation time is 247 ms vs 13958 ms without the change*
>  # It is more effective in filter pruning as is evident in some of the tests 
> in org.apache.spark.sql.catalyst.plans.OptimizedConstraintPropagationSuite 
> where current code is not able to identify the redundant filter in some cases.
>  # It is able to generate a better optimized plan for join queries as it can 
> push compound predicates.
>  # The current logic can miss a lot of possible cases of removing redundant 
> predicates, as it fails to take into account if same attribute or its aliases 
> are repeated multiple times in a complex expression.
>  # There are cases where some of the optimizer rules involving removal of 
> redundant predicates fail to remove on the basis of constraint data. In some 
> cases the rule works, just by the virtue of previous rules helping it out to 
> cover the inaccuracy. That the ConstraintPropagation rule & its function of 
> removal of redundant filters & addition of new inferred filters is dependent 
> on the working of some of the other unrelated previous optimizer rules is 
> behaving, is indicative of issues.
>  # It does away with all the EqualNullSafe constraints as this logic does not 
> need those constraints to be created.
>  # There is at least one test in existing ConstraintPropagationSuite which is 
> missing a IsNotNull constraints because the code incorrectly generated a 
> EqualsNullSafeConstraint instead of EqualTo constraint, when using the 
> existing Constraints code. With these changes, the test correctly creates an 
> EqualTo constraint, resulting in an inferred IsNotNull constraint
>  # It does away with the current combinatorial logic of evaluation all the 
> constraints can cause compilation to run into hours or cause OOM. The number 
> of constraints stored is exactly the same as the number of filters encountered
> h2. Q2. What problem is this proposal NOT designed to solve?
> It mainly focuses on compile time performance, but in some cases can benefit 
> run time characteristics too, like inferring IsNotNull filter or pushing down 
> compound predicates on the join, which currently may get missed/ does not 
> happen , respectively, by the present code.
> h2. Q3. How is it done today, and what are the limits of current practice?
> Current ConstraintsPropagation code, pessimistically tries to generates all 
> the possible combinations of constraints , based on

[jira] [Updated] (SPARK-47217) De-duplication of Relations in Joins, can result in plan resolution failure

2024-02-28 Thread Asif (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asif updated SPARK-47217:
-
Description: 
In case of some flavours of  nested joins involving repetition of relation, the 
projected columns when passed to the DataFrame.select API , as form of 
df.column , can result in plan resolution failure due to attribute resolution 
not happening.

A scenario in which this happens is
{noformat}
   
  Project ( dataframe A.column("col-a") )
 |
  Join2
  || 
   Join1  DataFrame A  
  |
 DataFrame ADataFrame B

{noformat}
In such cases, If it so happens that Join2 - right leg DataFrame A gets 
re-aliased due to De-Duplication of relations, and if the project uses Column 
definition obtained from DataFrame A, its exprId will not match the re-aliased 
Join2 - right Leg- DataFrame A , causing resolution failure.

  was:
In case of some flavours of self join queries or nested joins involving 
repetition of relation, the projected columns when passed to the 
DataFrame.select API , as form of df.column , can result in plan resolution 
failure due to attribute resolution not happening.

A scenario in which this happens is
{noformat}
   
  Project ( dataframe A.column("col-a") )
 |
  Join2
  || 
   Join1  DataFrame A  
  |
 DataFrame ADataFrame B

{noformat}
In such cases, If it so happens that Join2 - right leg DataFrame A gets 
re-aliased due to De-Duplication of relations, and if the project uses Column 
definition obtained from DataFrame A, its exprId will not match the re-aliased 
Join2 - right Leg- DataFrame A , causing resolution failure.


> De-duplication of Relations in Joins, can result in plan resolution failure
> ---
>
> Key: SPARK-47217
> URL: https://issues.apache.org/jira/browse/SPARK-47217
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.1
>Reporter: Asif
>Priority: Major
>  Labels: Spark-SQL
>
> In case of some flavours of  nested joins involving repetition of relation, 
> the projected columns when passed to the DataFrame.select API , as form of 
> df.column , can result in plan resolution failure due to attribute resolution 
> not happening.
> A scenario in which this happens is
> {noformat}
>
>   Project ( dataframe A.column("col-a") )
>  |
>   Join2
>   || 
>Join1  DataFrame A  
>   |
>  DataFrame ADataFrame B
> {noformat}
> In such cases, If it so happens that Join2 - right leg DataFrame A gets 
> re-aliased due to De-Duplication of relations, and if the project uses Column 
> definition obtained from DataFrame A, its exprId will not match the 
> re-aliased Join2 - right Leg- DataFrame A , causing resolution failure.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47217) De-duplication of Relations in Joins, can result in plan resolution failure

2024-02-28 Thread Asif (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asif updated SPARK-47217:
-
Description: 
In case of some flavours of self join queries or nested joins involving 
repetition of relation, the projected columns when passed to the 
DataFrame.select API , as form of df.column , can result in plan resolution 
failure due to attribute resolution not happening.

A scenario in which this happens is
{noformat}
   
  Project ( dataframe A.column("col-a") )
 |
  Join2
  || 
   Join1  DataFrame A  
  |
 DataFrame ADataFrame B

{noformat}
In such cases, If it so happens that Join2 - right leg DataFrame A gets 
re-aliased due to De-Duplication of relations, and if the project uses Column 
definition obtained from DataFrame A, its exprId will not match the re-aliased 
Join2 - right Leg- DataFrame A , causing resolution failure.

  was:
In case of some flavours of nested self join queries,  the projected columns 
when passed to the DataFrame.select API ,  as form of df.column ,  can result 
in plan resolution failure due to attribute resolution not happening.

A scenario in which this happens is
 
{noformat}
   
  Project ( dataframe A.column("col-a") )
 |
  Join2
  || 
   Join1  DataFrame A  
  |
 DataFrame ADataFrame B

{noformat}


In such cases, If it so happens that  Join2 - right leg DataFrame A gets 
re-aliased due to De-Duplication of relations,  and if the project uses Column 
definition obtained from DataFrame A, its exprId will not match the re-aliased  
Join2  - right Leg- DataFrame A , causing resolution failure.


> De-duplication of Relations in Joins, can result in plan resolution failure
> ---
>
> Key: SPARK-47217
> URL: https://issues.apache.org/jira/browse/SPARK-47217
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.1
>Reporter: Asif
>Priority: Major
>  Labels: Spark-SQL
>
> In case of some flavours of self join queries or nested joins involving 
> repetition of relation, the projected columns when passed to the 
> DataFrame.select API , as form of df.column , can result in plan resolution 
> failure due to attribute resolution not happening.
> A scenario in which this happens is
> {noformat}
>
>   Project ( dataframe A.column("col-a") )
>  |
>   Join2
>   || 
>Join1  DataFrame A  
>   |
>  DataFrame ADataFrame B
> {noformat}
> In such cases, If it so happens that Join2 - right leg DataFrame A gets 
> re-aliased due to De-Duplication of relations, and if the project uses Column 
> definition obtained from DataFrame A, its exprId will not match the 
> re-aliased Join2 - right Leg- DataFrame A , causing resolution failure.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47217) De-duplication of Relations in Joins, can result in plan resolution failure

2024-02-28 Thread Asif (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asif updated SPARK-47217:
-
Description: 
In case of some flavours of nested self join queries,  the projected columns 
when passed to the DataFrame.select API ,  as form of df.column ,  can result 
in plan resolution failure due to attribute resolution not happening.

A scenario in which this happens is
 
{noformat}
   
  Project ( dataframe A.column("col-a") )
 |
  Join2
  || 
   Join1  DataFrame A  
  |
 DataFrame ADataFrame B

{noformat}


In such cases, If it so happens that  Join2 - right leg DataFrame A gets 
re-aliased due to De-Duplication of relations,  and if the project uses Column 
definition obtained from DataFrame A, its exprId will not match the re-aliased  
Join2  - right Leg- DataFrame A , causing resolution failure.

  was:
In case of some flavours of nested self join queries,  the projected columns 
when passed to the DataFrame.select API ,  as form of df.column ,  can result 
in plan resolution failure due to attribute resolution not happening.

A scenario in which this happens is

   Project ( dataframe A.column("col-a") )
 |
  Join2
  |DataFrame A  
   Join1
  |
DataFrame ADataFrame B


In such cases, If it so happens that  Join2 - right leg DataFrame A gets 
re-aliased due to De-Duplication of relations,  and if the project uses Column 
definition obtained from DataFrame A, its exprId will not match the re-aliased  
Join2  - right Leg- DataFrame A , causing resolution failure.


> De-duplication of Relations in Joins, can result in plan resolution failure
> ---
>
> Key: SPARK-47217
> URL: https://issues.apache.org/jira/browse/SPARK-47217
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.1
>Reporter: Asif
>Priority: Major
>  Labels: Spark-SQL
>
> In case of some flavours of nested self join queries,  the projected columns 
> when passed to the DataFrame.select API ,  as form of df.column ,  can result 
> in plan resolution failure due to attribute resolution not happening.
> A scenario in which this happens is
>  
> {noformat}
>
>   Project ( dataframe A.column("col-a") )
>  |
>   Join2
>   || 
>Join1  DataFrame A  
>   |
>  DataFrame ADataFrame B
> {noformat}
> In such cases, If it so happens that  Join2 - right leg DataFrame A gets 
> re-aliased due to De-Duplication of relations,  and if the project uses 
> Column definition obtained from DataFrame A, its exprId will not match the 
> re-aliased  Join2  - right Leg- DataFrame A , causing resolution failure.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-47217) De-duplication of Relations in Joins, can result in plan resolution failure

2024-02-28 Thread Asif (Jira)

Asif created SPARK-47217:


 Summary: De-duplication of Relations in Joins, can result in plan 
resolution failure
 Key: SPARK-47217
 URL: https://issues.apache.org/jira/browse/SPARK-47217
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.5.1
Reporter: Asif


In case of some flavours of nested self join queries,  the projected columns 
when passed to the DataFrame.select API ,  as form of df.column ,  can result 
in plan resolution failure due to attribute resolution not happening.

A scenario in which this happens is

   Project ( dataframe A.column("col-a") )
 |
  Join2
  |DataFrame A  
   Join1
  |
DataFrame ADataFrame B


In such cases, If it so happens that  Join2 - right leg DataFrame A gets 
re-aliased due to De-Duplication of relations,  and if the project uses Column 
definition obtained from DataFrame A, its exprId will not match the re-aliased  
Join2  - right Leg- DataFrame A , causing resolution failure.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46671) InferFiltersFromConstraint rule is creating a redundant filter

2024-01-11 Thread Asif (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asif updated SPARK-46671:
-
Description: 
while bring my old PR which uses a different approach to  the 
ConstraintPropagation algorithm ( 
[SPARK-33152|https://issues.apache.org/jira/browse/SPARK-33152]) in synch with 
current master, I noticed a test failure in my branch for SPARK-33152:
The test which is failing is
InferFiltersFromConstraintSuite:
{code}
  test("SPARK-43095: Avoid Once strategy's idempotence is broken for batch: 
Infer Filters") {
val x = testRelation.as("x")
val y = testRelation.as("y")
val z = testRelation.as("z")

// Removes EqualNullSafe when constructing candidate constraints
comparePlans(
  InferFiltersFromConstraints(x.select($"x.a", $"x.a".as("xa"))
.where($"xa" <=> $"x.a" && $"xa" === $"x.a").analyze),
  x.select($"x.a", $"x.a".as("xa"))
.where($"xa".isNotNull && $"x.a".isNotNull && $"xa" <=> $"x.a" && $"xa" 
=== $"x.a").analyze)

// Once strategy's idempotence is not broken
val originalQuery =
  x.join(y, condition = Some($"x.a" === $"y.a"))
.select($"x.a", $"x.a".as("xa")).as("xy")
.join(z, condition = Some($"xy.a" === $"z.a")).analyze

val correctAnswer =
  x.where($"a".isNotNull).join(y.where($"a".isNotNull), condition = 
Some($"x.a" === $"y.a"))
.select($"x.a", $"x.a".as("xa")).as("xy")
.join(z.where($"a".isNotNull), condition = Some($"xy.a" === 
$"z.a")).analyze

val optimizedQuery = InferFiltersFromConstraints(originalQuery)
comparePlans(optimizedQuery, correctAnswer)
comparePlans(InferFiltersFromConstraints(optimizedQuery), correctAnswer)
  }
{code}

In the above test, I believe the below assertion is not proper.
There is a redundant filter which is getting created.
Out of these two isNotNull constraints,  only one should be created.

$"xa".isNotNull && $"x.a".isNotNull 
 Because "xa" is an alias of x."a" , so only one isNullConstraint is needed.

  // Removes EqualNullSafe when constructing candidate constraints
comparePlans(
  InferFiltersFromConstraints(x.select($"x.a", $"x.a".as("xa"))
.where($"xa" <=> $"x.a" && $"xa" === $"x.a").analyze),
  x.select($"x.a", $"x.a".as("xa"))
.where($"xa".isNotNull && $"x.a".isNotNull && $"xa" <=> $"x.a" && $"xa" 
=== $"x.a").analyze) 

This is not a big issue, but it highlights the need to take a relook at the 
code of ConstraintPropagation and related code.

I am filing this jira so that constraint code can be tightened/made more robust.

  was:
while bring my old PR which uses a different approach to  the 
ConstraintPropagation algorithm ( 
[SPARK-33152|https://issues.apache.org/jira/browse/SPARK-33152]) in synch with 
current master, I noticed a test failure in my branch for SPARK-33152:
The test which is failing is
InferFiltersFromConstraintSuite:
{code}
  test("SPARK-43095: Avoid Once strategy's idempotence is broken for batch: 
Infer Filters") {
val x = testRelation.as("x")
val y = testRelation.as("y")
val z = testRelation.as("z")

// Removes EqualNullSafe when constructing candidate constraints
comparePlans(
  InferFiltersFromConstraints(x.select($"x.a", $"x.a".as("xa"))
.where($"xa" <=> $"x.a" && $"xa" === $"x.a").analyze),
  x.select($"x.a", $"x.a".as("xa"))
.where($"xa".isNotNull && $"x.a".isNotNull && $"xa" <=> $"x.a" && $"xa" 
=== $"x.a").analyze)

// Once strategy's idempotence is not broken
val originalQuery =
  x.join(y, condition = Some($"x.a" === $"y.a"))
.select($"x.a", $"x.a".as("xa")).as("xy")
.join(z, condition = Some($"xy.a" === $"z.a")).analyze

val correctAnswer =
  x.where($"a".isNotNull).join(y.where($"a".isNotNull), condition = 
Some($"x.a" === $"y.a"))
.select($"x.a", $"x.a".as("xa")).as("xy")
.join(z.where($"a".isNotNull), condition = Some($"xy.a" === 
$"z.a")).analyze

val optimizedQuery = InferFiltersFromConstraints(originalQuery)
comparePlans(optimizedQuery, correctAnswer)
comparePlans(InferFiltersFromConstraints(optimizedQuery), correctAnswer)
  }
{code}

In the above test, I believe the below assertion is not proper.
There is a redundant filter which is getting created.
Out of these two isNotNull constraints,  only one should be created.

$"xa".isNotNull && $"x.a".isNotNull 
 Because presence of (xa#0 = a#0), automatically implies that is one attribute 
is not null, the other also has to be not null.

  // Removes EqualNullSafe when constructing candidate constraints
comparePlans(
  InferFiltersFromConstraints(x.select($"x.a", $"x.a".as("xa"))
.where($"xa" <=> $"x.a" && $"xa" === $"x.a").analyze),
  x.select($"x.a", $"x.a".as("xa"))
.where($"xa".isNotNull && $"x.a".isNotNull && $"xa" <=> $"x.a" && $"xa" 
=== $"x.a").analyze) 

This is not a big issue, but it highlights the

[jira] [Reopened] (SPARK-46671) InferFiltersFromConstraint rule is creating a redundant filter

2024-01-11 Thread Asif (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asif reopened SPARK-46671:
--

After further analysis , I believe , that what I said originally in the ticket 
is valid and that the code Does create a redundant constraint.

The reason is "xa" is an alias of "a", so there should be a IsNotNull 
constraint on only one of the attribute and not both. 

> InferFiltersFromConstraint rule is creating a redundant filter
> --
>
> Key: SPARK-46671
> URL: https://issues.apache.org/jira/browse/SPARK-46671
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Asif
>Priority: Minor
>  Labels: SQL, catalyst
>
> while bring my old PR which uses a different approach to  the 
> ConstraintPropagation algorithm ( 
> [SPARK-33152|https://issues.apache.org/jira/browse/SPARK-33152]) in synch 
> with current master, I noticed a test failure in my branch for SPARK-33152:
> The test which is failing is
> InferFiltersFromConstraintSuite:
> {code}
>   test("SPARK-43095: Avoid Once strategy's idempotence is broken for batch: 
> Infer Filters") {
> val x = testRelation.as("x")
> val y = testRelation.as("y")
> val z = testRelation.as("z")
> // Removes EqualNullSafe when constructing candidate constraints
> comparePlans(
>   InferFiltersFromConstraints(x.select($"x.a", $"x.a".as("xa"))
> .where($"xa" <=> $"x.a" && $"xa" === $"x.a").analyze),
>   x.select($"x.a", $"x.a".as("xa"))
> .where($"xa".isNotNull && $"x.a".isNotNull && $"xa" <=> $"x.a" && 
> $"xa" === $"x.a").analyze)
> // Once strategy's idempotence is not broken
> val originalQuery =
>   x.join(y, condition = Some($"x.a" === $"y.a"))
> .select($"x.a", $"x.a".as("xa")).as("xy")
> .join(z, condition = Some($"xy.a" === $"z.a")).analyze
> val correctAnswer =
>   x.where($"a".isNotNull).join(y.where($"a".isNotNull), condition = 
> Some($"x.a" === $"y.a"))
> .select($"x.a", $"x.a".as("xa")).as("xy")
> .join(z.where($"a".isNotNull), condition = Some($"xy.a" === 
> $"z.a")).analyze
> val optimizedQuery = InferFiltersFromConstraints(originalQuery)
> comparePlans(optimizedQuery, correctAnswer)
> comparePlans(InferFiltersFromConstraints(optimizedQuery), correctAnswer)
>   }
> {code}
> In the above test, I believe the below assertion is not proper.
> There is a redundant filter which is getting created.
> Out of these two isNotNull constraints,  only one should be created.
> $"xa".isNotNull && $"x.a".isNotNull 
>  Because presence of (xa#0 = a#0), automatically implies that is one 
> attribute is not null, the other also has to be not null.
>   // Removes EqualNullSafe when constructing candidate constraints
> comparePlans(
>   InferFiltersFromConstraints(x.select($"x.a", $"x.a".as("xa"))
> .where($"xa" <=> $"x.a" && $"xa" === $"x.a").analyze),
>   x.select($"x.a", $"x.a".as("xa"))
> .where($"xa".isNotNull && $"x.a".isNotNull && $"xa" <=> $"x.a" && 
> $"xa" === $"x.a").analyze) 
> This is not a big issue, but it highlights the need to take a relook at the 
> code of ConstraintPropagation and related code.
> I am filing this jira so that constraint code can be tightened/made more 
> robust.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-46671) InferFiltersFromConstraint rule is creating a redundant filter

2024-01-11 Thread Asif (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asif resolved SPARK-46671.
--
Resolution: Not A Bug

> InferFiltersFromConstraint rule is creating a redundant filter
> --
>
> Key: SPARK-46671
> URL: https://issues.apache.org/jira/browse/SPARK-46671
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Asif
>Priority: Minor
>  Labels: SQL, catalyst
>
> while bring my old PR which uses a different approach to  the 
> ConstraintPropagation algorithm ( 
> [SPARK-33152|https://issues.apache.org/jira/browse/SPARK-33152]) in synch 
> with current master, I noticed a test failure in my branch for SPARK-33152:
> The test which is failing is
> InferFiltersFromConstraintSuite:
> {code}
>   test("SPARK-43095: Avoid Once strategy's idempotence is broken for batch: 
> Infer Filters") {
> val x = testRelation.as("x")
> val y = testRelation.as("y")
> val z = testRelation.as("z")
> // Removes EqualNullSafe when constructing candidate constraints
> comparePlans(
>   InferFiltersFromConstraints(x.select($"x.a", $"x.a".as("xa"))
> .where($"xa" <=> $"x.a" && $"xa" === $"x.a").analyze),
>   x.select($"x.a", $"x.a".as("xa"))
> .where($"xa".isNotNull && $"x.a".isNotNull && $"xa" <=> $"x.a" && 
> $"xa" === $"x.a").analyze)
> // Once strategy's idempotence is not broken
> val originalQuery =
>   x.join(y, condition = Some($"x.a" === $"y.a"))
> .select($"x.a", $"x.a".as("xa")).as("xy")
> .join(z, condition = Some($"xy.a" === $"z.a")).analyze
> val correctAnswer =
>   x.where($"a".isNotNull).join(y.where($"a".isNotNull), condition = 
> Some($"x.a" === $"y.a"))
> .select($"x.a", $"x.a".as("xa")).as("xy")
> .join(z.where($"a".isNotNull), condition = Some($"xy.a" === 
> $"z.a")).analyze
> val optimizedQuery = InferFiltersFromConstraints(originalQuery)
> comparePlans(optimizedQuery, correctAnswer)
> comparePlans(InferFiltersFromConstraints(optimizedQuery), correctAnswer)
>   }
> {code}
> In the above test, I believe the below assertion is not proper.
> There is a redundant filter which is getting created.
> Out of these two isNotNull constraints,  only one should be created.
> $"xa".isNotNull && $"x.a".isNotNull 
>  Because presence of (xa#0 = a#0), automatically implies that is one 
> attribute is not null, the other also has to be not null.
>   // Removes EqualNullSafe when constructing candidate constraints
> comparePlans(
>   InferFiltersFromConstraints(x.select($"x.a", $"x.a".as("xa"))
> .where($"xa" <=> $"x.a" && $"xa" === $"x.a").analyze),
>   x.select($"x.a", $"x.a".as("xa"))
> .where($"xa".isNotNull && $"x.a".isNotNull && $"xa" <=> $"x.a" && 
> $"xa" === $"x.a").analyze) 
> This is not a big issue, but it highlights the need to take a relook at the 
> code of ConstraintPropagation and related code.
> I am filing this jira so that constraint code can be tightened/made more 
> robust.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-46671) InferFiltersFromConstraint rule is creating a redundant filter

2024-01-11 Thread Asif (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-46671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17805434#comment-17805434
 ] 

Asif commented on SPARK-46671:
--

on further thoughts , I am wrong.. There should be 2 separate isNotNull 
constraints..

> InferFiltersFromConstraint rule is creating a redundant filter
> --
>
> Key: SPARK-46671
> URL: https://issues.apache.org/jira/browse/SPARK-46671
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Asif
>Priority: Minor
>  Labels: SQL, catalyst
>
> while bring my old PR which uses a different approach to  the 
> ConstraintPropagation algorithm ( 
> [SPARK-33152|https://issues.apache.org/jira/browse/SPARK-33152]) in synch 
> with current master, I noticed a test failure in my branch for SPARK-33152:
> The test which is failing is
> InferFiltersFromConstraintSuite:
> {code}
>   test("SPARK-43095: Avoid Once strategy's idempotence is broken for batch: 
> Infer Filters") {
> val x = testRelation.as("x")
> val y = testRelation.as("y")
> val z = testRelation.as("z")
> // Removes EqualNullSafe when constructing candidate constraints
> comparePlans(
>   InferFiltersFromConstraints(x.select($"x.a", $"x.a".as("xa"))
> .where($"xa" <=> $"x.a" && $"xa" === $"x.a").analyze),
>   x.select($"x.a", $"x.a".as("xa"))
> .where($"xa".isNotNull && $"x.a".isNotNull && $"xa" <=> $"x.a" && 
> $"xa" === $"x.a").analyze)
> // Once strategy's idempotence is not broken
> val originalQuery =
>   x.join(y, condition = Some($"x.a" === $"y.a"))
> .select($"x.a", $"x.a".as("xa")).as("xy")
> .join(z, condition = Some($"xy.a" === $"z.a")).analyze
> val correctAnswer =
>   x.where($"a".isNotNull).join(y.where($"a".isNotNull), condition = 
> Some($"x.a" === $"y.a"))
> .select($"x.a", $"x.a".as("xa")).as("xy")
> .join(z.where($"a".isNotNull), condition = Some($"xy.a" === 
> $"z.a")).analyze
> val optimizedQuery = InferFiltersFromConstraints(originalQuery)
> comparePlans(optimizedQuery, correctAnswer)
> comparePlans(InferFiltersFromConstraints(optimizedQuery), correctAnswer)
>   }
> {code}
> In the above test, I believe the below assertion is not proper.
> There is a redundant filter which is getting created.
> Out of these two isNotNull constraints,  only one should be created.
> $"xa".isNotNull && $"x.a".isNotNull 
>  Because presence of (xa#0 = a#0), automatically implies that is one 
> attribute is not null, the other also has to be not null.
>   // Removes EqualNullSafe when constructing candidate constraints
> comparePlans(
>   InferFiltersFromConstraints(x.select($"x.a", $"x.a".as("xa"))
> .where($"xa" <=> $"x.a" && $"xa" === $"x.a").analyze),
>   x.select($"x.a", $"x.a".as("xa"))
> .where($"xa".isNotNull && $"x.a".isNotNull && $"xa" <=> $"x.a" && 
> $"xa" === $"x.a").analyze) 
> This is not a big issue, but it highlights the need to take a relook at the 
> code of ConstraintPropagation and related code.
> I am filing this jira so that constraint code can be tightened/made more 
> robust.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-46671) InferFiltersFromConstraint rule is creating a redundant filter

2024-01-11 Thread Asif (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-46671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17805435#comment-17805435
 ] 

Asif commented on SPARK-46671:
--

so closing the ticket

> InferFiltersFromConstraint rule is creating a redundant filter
> --
>
> Key: SPARK-46671
> URL: https://issues.apache.org/jira/browse/SPARK-46671
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Asif
>Priority: Minor
>  Labels: SQL, catalyst
>
> while bring my old PR which uses a different approach to  the 
> ConstraintPropagation algorithm ( 
> [SPARK-33152|https://issues.apache.org/jira/browse/SPARK-33152]) in synch 
> with current master, I noticed a test failure in my branch for SPARK-33152:
> The test which is failing is
> InferFiltersFromConstraintSuite:
> {code}
>   test("SPARK-43095: Avoid Once strategy's idempotence is broken for batch: 
> Infer Filters") {
> val x = testRelation.as("x")
> val y = testRelation.as("y")
> val z = testRelation.as("z")
> // Removes EqualNullSafe when constructing candidate constraints
> comparePlans(
>   InferFiltersFromConstraints(x.select($"x.a", $"x.a".as("xa"))
> .where($"xa" <=> $"x.a" && $"xa" === $"x.a").analyze),
>   x.select($"x.a", $"x.a".as("xa"))
> .where($"xa".isNotNull && $"x.a".isNotNull && $"xa" <=> $"x.a" && 
> $"xa" === $"x.a").analyze)
> // Once strategy's idempotence is not broken
> val originalQuery =
>   x.join(y, condition = Some($"x.a" === $"y.a"))
> .select($"x.a", $"x.a".as("xa")).as("xy")
> .join(z, condition = Some($"xy.a" === $"z.a")).analyze
> val correctAnswer =
>   x.where($"a".isNotNull).join(y.where($"a".isNotNull), condition = 
> Some($"x.a" === $"y.a"))
> .select($"x.a", $"x.a".as("xa")).as("xy")
> .join(z.where($"a".isNotNull), condition = Some($"xy.a" === 
> $"z.a")).analyze
> val optimizedQuery = InferFiltersFromConstraints(originalQuery)
> comparePlans(optimizedQuery, correctAnswer)
> comparePlans(InferFiltersFromConstraints(optimizedQuery), correctAnswer)
>   }
> {code}
> In the above test, I believe the below assertion is not proper.
> There is a redundant filter which is getting created.
> Out of these two isNotNull constraints,  only one should be created.
> $"xa".isNotNull && $"x.a".isNotNull 
>  Because presence of (xa#0 = a#0), automatically implies that is one 
> attribute is not null, the other also has to be not null.
>   // Removes EqualNullSafe when constructing candidate constraints
> comparePlans(
>   InferFiltersFromConstraints(x.select($"x.a", $"x.a".as("xa"))
> .where($"xa" <=> $"x.a" && $"xa" === $"x.a").analyze),
>   x.select($"x.a", $"x.a".as("xa"))
> .where($"xa".isNotNull && $"x.a".isNotNull && $"xa" <=> $"x.a" && 
> $"xa" === $"x.a").analyze) 
> This is not a big issue, but it highlights the need to take a relook at the 
> code of ConstraintPropagation and related code.
> I am filing this jira so that constraint code can be tightened/made more 
> robust.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-46671) InferFiltersFromConstraint rule is creating a redundant filter

2024-01-10 Thread Asif (Jira)

Asif created SPARK-46671:


 Summary: InferFiltersFromConstraint rule is creating a redundant 
filter
 Key: SPARK-46671
 URL: https://issues.apache.org/jira/browse/SPARK-46671
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.5.0
Reporter: Asif


while bring my old PR which uses a different approach to  the 
ConstraintPropagation algorithm ( 
[SPARK-33152|https://issues.apache.org/jira/browse/SPARK-33152]) in synch with 
current master, I noticed a test failure in my branch for SPARK-33152:
The test which is failing is
InferFiltersFromConstraintSuite:
{code}
  test("SPARK-43095: Avoid Once strategy's idempotence is broken for batch: 
Infer Filters") {
val x = testRelation.as("x")
val y = testRelation.as("y")
val z = testRelation.as("z")

// Removes EqualNullSafe when constructing candidate constraints
comparePlans(
  InferFiltersFromConstraints(x.select($"x.a", $"x.a".as("xa"))
.where($"xa" <=> $"x.a" && $"xa" === $"x.a").analyze),
  x.select($"x.a", $"x.a".as("xa"))
.where($"xa".isNotNull && $"x.a".isNotNull && $"xa" <=> $"x.a" && $"xa" 
=== $"x.a").analyze)

// Once strategy's idempotence is not broken
val originalQuery =
  x.join(y, condition = Some($"x.a" === $"y.a"))
.select($"x.a", $"x.a".as("xa")).as("xy")
.join(z, condition = Some($"xy.a" === $"z.a")).analyze

val correctAnswer =
  x.where($"a".isNotNull).join(y.where($"a".isNotNull), condition = 
Some($"x.a" === $"y.a"))
.select($"x.a", $"x.a".as("xa")).as("xy")
.join(z.where($"a".isNotNull), condition = Some($"xy.a" === 
$"z.a")).analyze

val optimizedQuery = InferFiltersFromConstraints(originalQuery)
comparePlans(optimizedQuery, correctAnswer)
comparePlans(InferFiltersFromConstraints(optimizedQuery), correctAnswer)
  }
{code}

In the above test, I believe the below assertion is not proper.
There is a redundant filter which is getting created.
Out of these two isNotNull constraints,  only one should be created.

$"xa".isNotNull && $"x.a".isNotNull 
 Because presence of (xa#0 = a#0), automatically implies that is one attribute 
is not null, the other also has to be not null.

  // Removes EqualNullSafe when constructing candidate constraints
comparePlans(
  InferFiltersFromConstraints(x.select($"x.a", $"x.a".as("xa"))
.where($"xa" <=> $"x.a" && $"xa" === $"x.a").analyze),
  x.select($"x.a", $"x.a".as("xa"))
.where($"xa".isNotNull && $"x.a".isNotNull && $"xa" <=> $"x.a" && $"xa" 
=== $"x.a").analyze) 

This is not a big issue, but it highlights the need to take a relook at the 
code of ConstraintPropagation and related code.

I am filing this jira so that constraint code can be tightened/made more robust.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45959) SPIP: Abusing DataSet.withColumn can cause huge tree with severe perf degradation

2024-01-09 Thread Asif (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asif updated SPARK-45959:
-
Description: 
Though documentation clearly recommends to add all columns in a single shot, 
but in reality is difficult to expect customer to modify their code, as in 
spark2  the rules in analyzer were such that  they did not do deep tree 
traversal.  Moreover in Spark3 , the plans are cloned before giving to analyzer 
, optimizer etc which was not the case in Spark2.
All these things have resulted in query time being increased from 5 min to 2 - 
3 hrs.
Many times the columns are added to plan via some for loop logic which just 
keeps adding new computation based on some rule.

So,  my suggestion is to Collapse the Projects early, once the analysis of the 
logical plan is done, but before the plan gets assigned to the field variable 
in QueryExecution.
The PR for the above is ready for review.
The major change is in the way the lookup is performed in CacheManager.
I have described the logic in the PR and have added multiple tests.

  was:
Though documentation clearly recommends to add all columns in a single shot, 
but in reality is difficult to expect customer to modify their code, as in 
spark2  the rules in analyzer were such that  they did not do deep tree 
traversal.  Moreover in Spark3 , the plans are cloned before giving to analyzer 
, optimizer etc which was not the case in Spark2.
All these things have resulted in query time being increased from 5 min to 2 - 
3 hrs.
Many times the columns are added to plan via some for loop logic which just 
keeps adding new computation based on some rule.

So,  my suggestion is to do some intial check in the withColumn api, before 
creating a new projection, like if all the existing columns are still being 
projected, and the new column being added has an expression which is not 
depending on the output of the top node , but its child,  then instead of 
adding a new project, the column can be added to the existing node.
For starts, may be we can just handle Project node ..


> SPIP: Abusing DataSet.withColumn can cause huge tree with severe perf 
> degradation
> -
>
> Key: SPARK-45959
> URL: https://issues.apache.org/jira/browse/SPARK-45959
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.1
>Reporter: Asif
>Priority: Minor
>  Labels: pull-request-available
>
> Though documentation clearly recommends to add all columns in a single shot, 
> but in reality is difficult to expect customer to modify their code, as in 
> spark2  the rules in analyzer were such that  they did not do deep tree 
> traversal.  Moreover in Spark3 , the plans are cloned before giving to 
> analyzer , optimizer etc which was not the case in Spark2.
> All these things have resulted in query time being increased from 5 min to 2 
> - 3 hrs.
> Many times the columns are added to plan via some for loop logic which just 
> keeps adding new computation based on some rule.
> So,  my suggestion is to Collapse the Projects early, once the analysis of 
> the logical plan is done, but before the plan gets assigned to the field 
> variable in QueryExecution.
> The PR for the above is ready for review.
> The major change is in the way the lookup is performed in CacheManager.
> I have described the logic in the PR and have added multiple tests.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45959) SPIP: Abusing DataSet.withColumn can cause huge tree with severe perf degradation

2024-01-09 Thread Asif (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asif updated SPARK-45959:
-
Summary: SPIP: Abusing DataSet.withColumn can cause huge tree with severe 
perf degradation  (was: Abusing DataSet.withColumn can cause huge tree with 
severe perf degradation)

> SPIP: Abusing DataSet.withColumn can cause huge tree with severe perf 
> degradation
> -
>
> Key: SPARK-45959
> URL: https://issues.apache.org/jira/browse/SPARK-45959
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.1
>Reporter: Asif
>Priority: Minor
>  Labels: pull-request-available
>
> Though documentation clearly recommends to add all columns in a single shot, 
> but in reality is difficult to expect customer to modify their code, as in 
> spark2  the rules in analyzer were such that  they did not do deep tree 
> traversal.  Moreover in Spark3 , the plans are cloned before giving to 
> analyzer , optimizer etc which was not the case in Spark2.
> All these things have resulted in query time being increased from 5 min to 2 
> - 3 hrs.
> Many times the columns are added to plan via some for loop logic which just 
> keeps adding new computation based on some rule.
> So,  my suggestion is to do some intial check in the withColumn api, before 
> creating a new projection, like if all the existing columns are still being 
> projected, and the new column being added has an expression which is not 
> depending on the output of the top node , but its child,  then instead of 
> adding a new project, the column can be added to the existing node.
> For starts, may be we can just handle Project node ..



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45959) Abusing DataSet.withColumn can cause huge tree with severe perf degradation

2023-11-16 Thread Asif (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asif updated SPARK-45959:
-
Priority: Minor  (was: Major)

> Abusing DataSet.withColumn can cause huge tree with severe perf degradation
> ---
>
> Key: SPARK-45959
> URL: https://issues.apache.org/jira/browse/SPARK-45959
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.1
>Reporter: Asif
>Priority: Minor
>
> Though documentation clearly recommends to add all columns in a single shot, 
> but in reality is difficult to expect customer to modify their code, as in 
> spark2  the rules in analyzer were such that  they did not do deep tree 
> traversal.  Moreover in Spark3 , the plans are cloned before giving to 
> analyzer , optimizer etc which was not the case in Spark2.
> All these things have resulted in query time being increased from 5 min to 2 
> - 3 hrs.
> Many times the columns are added to plan via some for loop logic which just 
> keeps adding new computation based on some rule.
> So,  my suggestion is to do some intial check in the withColumn api, before 
> creating a new projection, like if all the existing columns are still being 
> projected, and the new column being added has an expression which is not 
> depending on the output of the top node , but its child,  then instead of 
> adding a new project, the column can be added to the existing node.
> For starts, may be we can just handle Project node ..



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-45959) Abusing DataSet.withColumn can cause huge tree with severe perf degradation

2023-11-16 Thread Asif (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-45959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17786941#comment-17786941
 ] 

Asif commented on SPARK-45959:
--

will create a PR for the same..

> Abusing DataSet.withColumn can cause huge tree with severe perf degradation
> ---
>
> Key: SPARK-45959
> URL: https://issues.apache.org/jira/browse/SPARK-45959
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.1
>Reporter: Asif
>Priority: Major
>
> Though documentation clearly recommends to add all columns in a single shot, 
> but in reality is difficult to expect customer to modify their code, as in 
> spark2  the rules in analyzer were such that  they did not do deep tree 
> traversal.  Moreover in Spark3 , the plans are cloned before giving to 
> analyzer , optimizer etc which was not the case in Spark2.
> All these things have resulted in query time being increased from 5 min to 2 
> - 3 hrs.
> Many times the columns are added to plan via some for loop logic which just 
> keeps adding new computation based on some rule.
> So,  my suggestion is to do some intial check in the withColumn api, before 
> creating a new projection, like if all the existing columns are still being 
> projected, and the new column being added has an expression which is not 
> depending on the output of the top node , but its child,  then instead of 
> adding a new project, the column can be added to the existing node.
> For starts, may be we can just handle Project node ..



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-45959) Abusing DataSet.withColumn can cause huge tree with severe perf degradation

2023-11-16 Thread Asif (Jira)

Asif created SPARK-45959:


 Summary: Abusing DataSet.withColumn can cause huge tree with 
severe perf degradation
 Key: SPARK-45959
 URL: https://issues.apache.org/jira/browse/SPARK-45959
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.5.1
Reporter: Asif


Though documentation clearly recommends to add all columns in a single shot, 
but in reality is difficult to expect customer to modify their code, as in 
spark2  the rules in analyzer were such that  they did not do deep tree 
traversal.  Moreover in Spark3 , the plans are cloned before giving to analyzer 
, optimizer etc which was not the case in Spark2.
All these things have resulted in query time being increased from 5 min to 2 - 
3 hrs.
Many times the columns are added to plan via some for loop logic which just 
keeps adding new computation based on some rule.

So,  my suggestion is to do some intial check in the withColumn api, before 
creating a new projection, like if all the existing columns are still being 
projected, and the new column being added has an expression which is not 
depending on the output of the top node , but its child,  then instead of 
adding a new project, the column can be added to the existing node.
For starts, may be we can just handle Project node ..



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-45943) DataSourceV2Relation.computeStats throws IllegalStateException in test mode

2023-11-16 Thread Asif (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-45943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17786652#comment-17786652
 ] 

Asif commented on SPARK-45943:
--

thanks [~wforget] for the input.. if you have solution pls open PR, else I can 
give a shot.


> DataSourceV2Relation.computeStats throws IllegalStateException in test mode
> ---
>
> Key: SPARK-45943
> URL: https://issues.apache.org/jira/browse/SPARK-45943
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.1
>Reporter: Asif
>Priority: Major
>
> This issue surfaces when the new unit test of PR 
> SPARK-45866|https://github.com/apache/spark/pull/43824] is added



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45866) Reuse of exchange in AQE does not happen when run time filters are pushed down to the underlying Scan ( like iceberg )

2023-11-15 Thread Asif (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asif updated SPARK-45866:
-
Labels: pull-request-available  (was: )

> Reuse of exchange in AQE does not happen when run time filters are pushed 
> down to the underlying Scan ( like iceberg )
> --
>
> Key: SPARK-45866
> URL: https://issues.apache.org/jira/browse/SPARK-45866
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.1
>Reporter: Asif
>Priority: Major
>  Labels: pull-request-available
>
> In certain types of queries for eg TPCDS Query 14b,  the reuse of exchange 
> does not happen in AQE , resulting in perf degradation.
> The spark TPCDS tests are unable to catch the problem, because the 
> InMemoryScan used for testing do not implement the equals & hashCode 
> correctly , in the sense , that they do take into account the pushed down run 
> time filters.
> In concrete Scan implementations, for eg iceberg's SparkBatchQueryScan , the 
> equality check , apart from other things, also involves Runtime Filters 
> pushed ( which is correct).
> In spark the issue is this:
> For a given stage being materialized,  just before materialization starts, 
> the run time filters are confined to the BatchScanExec level.
> Only when the actual RDD corresponding to the BatchScanExec, is being 
> evaluated,  do the runtime filters get pushed to the underlying Scan.
> Now if a new stage is created and it checks in the stageCache using its 
> canonicalized plan to see if a stage can be reused, it fails to find the 
> r-usable  stage even if the stage exists, because the canonicalized spark 
> plan present in the stage cache, has now the run time filters pushed to the 
> Scan , so the incoming canonicalized spark plan does not match the key as 
> their underlying scans differ . that is incoming spark plan's scan does not 
> have runtime filters , while the canonicalized spark plan present as key in 
> the stage cache has the scan with runtime filters pushed.
> The fix as I have worked is to provide, two methods in the 
> SupportsRuntimeV2Filtering interface ,
> default boolean equalToIgnoreRuntimeFilters(Scan other) {
> return this.equals(other);
>   }
>   default int hashCodeIgnoreRuntimeFilters() {
> return this.hashCode();
>   }
> In the BatchScanExec, if the scan implements SupportsRuntimeV2Filtering, then 
> instead of batch.equals, it should call scan.equalToIgnoreRuntimeFilters
> And the underlying Scan implementations should provide equality which 
> excludes run time filters.
> Similarly the hashCode of BatchScanExec, should use 
> scan.hashCodeIgnoreRuntimeFilters instead of ( batch.hashCode).
> Will be creating a PR with bug test for review.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-45943) DataSourceV2Relation.computeStats throws IllegalStateException in test mode

2023-11-15 Thread Asif (Jira)

Asif created SPARK-45943:


 Summary: DataSourceV2Relation.computeStats throws 
IllegalStateException in test mode
 Key: SPARK-45943
 URL: https://issues.apache.org/jira/browse/SPARK-45943
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.5.1
Reporter: Asif


This issue surfaces when the new unit test of PR 
SPARK-45866|https://github.com/apache/spark/pull/43824] is added



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-45924) Canonicalization of SubqueryAdaptiveBroadcastExec is broken and is not equivalent with SubqueryBroadcastExec

2023-11-15 Thread Asif (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asif closed SPARK-45924.


this is not a bug

> Canonicalization of SubqueryAdaptiveBroadcastExec is broken and is not 
> equivalent with SubqueryBroadcastExec
> 
>
> Key: SPARK-45924
> URL: https://issues.apache.org/jira/browse/SPARK-45924
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.1
>Reporter: Asif
>Priority: Major
>  Labels: pull-request-available
>
> while writing bug test for  
> [SPARK-45866|https://issues.apache.org/jira/projects/SPARK/issues/SPARK-45866],
>  found that canonicalization of SubqueryAdaptiveBroadcastExec is broken in 
> the sense that buildPlan : LogicalPlan is not canonicalized which causes 
> batchscans to differ when reuse of exchange check happens in AQE.
> Moreover the equivalence of SubqueryAdaptiveBroadcastExec and 
> SubqueryBroadcastExec  is not there which also aggravates the re-use of 
> exchange in aqe broken.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-45925) SubqueryBroadcastExec is not equivalent with SubqueryAdaptiveBroadcastExec causing re-use of exchange not happening in AQE

2023-11-15 Thread Asif (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asif closed SPARK-45925.


this is not an issue

> SubqueryBroadcastExec is not equivalent with SubqueryAdaptiveBroadcastExec 
> causing re-use of exchange not happening in AQE
> --
>
> Key: SPARK-45925
> URL: https://issues.apache.org/jira/browse/SPARK-45925
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.1
>Reporter: Asif
>Priority: Major
>  Labels: pull-request-available
>
> A created stage may contain SubqueryAdaptiveBroadcastExec while incominng 
> exchange may contain SubqueryBroadcastExec and though they are equivalent , 
> the match does not happen because equals/hashCode do not match , resulting in 
> non re-use of exchange.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-45924) Canonicalization of SubqueryAdaptiveBroadcastExec is broken and is not equivalent with SubqueryBroadcastExec

2023-11-15 Thread Asif (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asif resolved SPARK-45924.
--
Resolution: Not A Bug

> Canonicalization of SubqueryAdaptiveBroadcastExec is broken and is not 
> equivalent with SubqueryBroadcastExec
> 
>
> Key: SPARK-45924
> URL: https://issues.apache.org/jira/browse/SPARK-45924
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.1
>Reporter: Asif
>Priority: Major
>  Labels: pull-request-available
>
> while writing bug test for  
> [SPARK-45866|https://issues.apache.org/jira/projects/SPARK/issues/SPARK-45866],
>  found that canonicalization of SubqueryAdaptiveBroadcastExec is broken in 
> the sense that buildPlan : LogicalPlan is not canonicalized which causes 
> batchscans to differ when reuse of exchange check happens in AQE.
> Moreover the equivalence of SubqueryAdaptiveBroadcastExec and 
> SubqueryBroadcastExec  is not there which also aggravates the re-use of 
> exchange in aqe broken.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-45925) SubqueryBroadcastExec is not equivalent with SubqueryAdaptiveBroadcastExec causing re-use of exchange not happening in AQE

2023-11-15 Thread Asif (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asif resolved SPARK-45925.
--
Resolution: Not A Problem

> SubqueryBroadcastExec is not equivalent with SubqueryAdaptiveBroadcastExec 
> causing re-use of exchange not happening in AQE
> --
>
> Key: SPARK-45925
> URL: https://issues.apache.org/jira/browse/SPARK-45925
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.1
>Reporter: Asif
>Priority: Major
>  Labels: pull-request-available
>
> A created stage may contain SubqueryAdaptiveBroadcastExec while incominng 
> exchange may contain SubqueryBroadcastExec and though they are equivalent , 
> the match does not happen because equals/hashCode do not match , resulting in 
> non re-use of exchange.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-45866) Reuse of exchange in AQE does not happen when run time filters are pushed down to the underlying Scan ( like iceberg )

2023-11-14 Thread Asif (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-45866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17786155#comment-17786155
 ] 

Asif commented on SPARK-45866:
--

Now that the other PRs on which this ticket itself is dependent are created, I 
will open a PR with bug test tomorrow. Ofcourse the bugtest itself will fail 
till the master contains all the dependent PRs

> Reuse of exchange in AQE does not happen when run time filters are pushed 
> down to the underlying Scan ( like iceberg )
> --
>
> Key: SPARK-45866
> URL: https://issues.apache.org/jira/browse/SPARK-45866
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.1
>Reporter: Asif
>Priority: Major
>
> In certain types of queries for eg TPCDS Query 14b,  the reuse of exchange 
> does not happen in AQE , resulting in perf degradation.
> The spark TPCDS tests are unable to catch the problem, because the 
> InMemoryScan used for testing do not implement the equals & hashCode 
> correctly , in the sense , that they do take into account the pushed down run 
> time filters.
> In concrete Scan implementations, for eg iceberg's SparkBatchQueryScan , the 
> equality check , apart from other things, also involves Runtime Filters 
> pushed ( which is correct).
> In spark the issue is this:
> For a given stage being materialized,  just before materialization starts, 
> the run time filters are confined to the BatchScanExec level.
> Only when the actual RDD corresponding to the BatchScanExec, is being 
> evaluated,  do the runtime filters get pushed to the underlying Scan.
> Now if a new stage is created and it checks in the stageCache using its 
> canonicalized plan to see if a stage can be reused, it fails to find the 
> r-usable  stage even if the stage exists, because the canonicalized spark 
> plan present in the stage cache, has now the run time filters pushed to the 
> Scan , so the incoming canonicalized spark plan does not match the key as 
> their underlying scans differ . that is incoming spark plan's scan does not 
> have runtime filters , while the canonicalized spark plan present as key in 
> the stage cache has the scan with runtime filters pushed.
> The fix as I have worked is to provide, two methods in the 
> SupportsRuntimeV2Filtering interface ,
> default boolean equalToIgnoreRuntimeFilters(Scan other) {
> return this.equals(other);
>   }
>   default int hashCodeIgnoreRuntimeFilters() {
> return this.hashCode();
>   }
> In the BatchScanExec, if the scan implements SupportsRuntimeV2Filtering, then 
> instead of batch.equals, it should call scan.equalToIgnoreRuntimeFilters
> And the underlying Scan implementations should provide equality which 
> excludes run time filters.
> Similarly the hashCode of BatchScanExec, should use 
> scan.hashCodeIgnoreRuntimeFilters instead of ( batch.hashCode).
> Will be creating a PR with bug test for review.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45925) SubqueryBroadcastExec is not equivalent with SubqueryAdaptiveBroadcastExec causing re-use of exchange not happening in AQE

2023-11-14 Thread Asif (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asif updated SPARK-45925:
-
Labels: pull-request-available  (was: )

> SubqueryBroadcastExec is not equivalent with SubqueryAdaptiveBroadcastExec 
> causing re-use of exchange not happening in AQE
> --
>
> Key: SPARK-45925
> URL: https://issues.apache.org/jira/browse/SPARK-45925
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.1
>Reporter: Asif
>Priority: Major
>  Labels: pull-request-available
>
> A created stage may contain SubqueryAdaptiveBroadcastExec while incominng 
> exchange may contain SubqueryBroadcastExec and though they are equivalent , 
> the match does not happen because equals/hashCode do not match , resulting in 
> non re-use of exchange.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45924) Canonicalization of SubqueryAdaptiveBroadcastExec is broken and is not equivalent with SubqueryBroadcastExec

2023-11-14 Thread Asif (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asif updated SPARK-45924:
-
Labels: pull-request-available  (was: )

> Canonicalization of SubqueryAdaptiveBroadcastExec is broken and is not 
> equivalent with SubqueryBroadcastExec
> 
>
> Key: SPARK-45924
> URL: https://issues.apache.org/jira/browse/SPARK-45924
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.1
>Reporter: Asif
>Priority: Major
>  Labels: pull-request-available
>
> while writing bug test for  
> [SPARK-45866|https://issues.apache.org/jira/projects/SPARK/issues/SPARK-45866],
>  found that canonicalization of SubqueryAdaptiveBroadcastExec is broken in 
> the sense that buildPlan : LogicalPlan is not canonicalized which causes 
> batchscans to differ when reuse of exchange check happens in AQE.
> Moreover the equivalence of SubqueryAdaptiveBroadcastExec and 
> SubqueryBroadcastExec  is not there which also aggravates the re-use of 
> exchange in aqe broken.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-45926) The InMemoryV2FilterBatchScan and InMemoryBatchScan are not implementing equals and hashCode correctly

2023-11-14 Thread Asif (Jira)

Asif created SPARK-45926:


 Summary: The InMemoryV2FilterBatchScan and InMemoryBatchScan are 
not implementing equals and hashCode correctly 
 Key: SPARK-45926
 URL: https://issues.apache.org/jira/browse/SPARK-45926
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.5.1
Reporter: Asif


These   InMemoryV2FilterBatchScan and InMemoryBatchScan test classes are not 
implementing hashCode and equals correctly as they are not taking into account 
the pushed runtime filters. As a result they are unable to expose the TPCDS 
test issues which can show whether the reuse of exchange is happening correctly 
or not.
If these classes implement equals and hashCode taking into account the pushed 
runtime filters,  we would see that TPCDS Q14b which should ideally be reusing 
the exchange containing Union ,  is not happening due to multiple bugs which 
surface in AQE.

Actual V2 DataSources  like iceberg correctly implement equals and hashCode 
taking into account pushed runtime filters , which also expose the same issue 
of reuse of exchnage not happening



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-45925) SubqueryBroadcastExec is not equivalent with SubqueryAdaptiveBroadcastExec causing re-use of exchange not happening in AQE

2023-11-14 Thread Asif (Jira)

Asif created SPARK-45925:


 Summary: SubqueryBroadcastExec is not equivalent with 
SubqueryAdaptiveBroadcastExec causing re-use of exchange not happening in AQE
 Key: SPARK-45925
 URL: https://issues.apache.org/jira/browse/SPARK-45925
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.5.1
Reporter: Asif


A created stage may contain SubqueryAdaptiveBroadcastExec while incominng 
exchange may contain SubqueryBroadcastExec and though they are equivalent , the 
match does not happen because equals/hashCode do not match , resulting in non 
re-use of exchange.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] (SPARK-45658) Canonicalization of DynamicPruningSubquery is broken

2023-11-14 Thread Asif (Jira)



[ https://issues.apache.org/jira/browse/SPARK-45658 ]


Asif deleted comment on SPARK-45658:
--

was (Author: ashahid7):
I also think that during canonicalization of DynamicPruningSubquery, the 
pruning key's canonicalization should be done on the basis of the enclosing 
Plan which contains the DynamicPruningSubquery Expression

> Canonicalization of DynamicPruningSubquery is broken
> 
>
> Key: SPARK-45658
> URL: https://issues.apache.org/jira/browse/SPARK-45658
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Asif
>Priority: Major
>  Labels: pull-request-available
>
> The canonicalization of (buildKeys: Seq[Expression]) in the class 
> DynamicPruningSubquery is broken, as the buildKeys are canonicalized just by 
> calling 
> buildKeys.map(_.canonicalized)
> The  above would result in incorrect canonicalization as it would not be 
> normalizing the exprIds relative to buildQuery output
> The fix is to use the buildQuery : LogicalPlan's output to normalize the 
> buildKeys expression
> as given below, using the standard approach.
> buildKeys.map(QueryPlan.normalizeExpressions(_, buildQuery.output)),
> Will be filing a PR and bug test for the same.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45924) Canonicalization of SubqueryAdaptiveBroadcastExec is broken and is not equivalent with SubqueryBroadcastExec

2023-11-14 Thread Asif (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asif updated SPARK-45924:
-
Description: 
while writing bug test for  
[SPARK-45866|https://issues.apache.org/jira/projects/SPARK/issues/SPARK-45866], 
found that canonicalization of SubqueryAdaptiveBroadcastExec is broken in the 
sense that buildPlan : LogicalPlan is not canonicalized which causes batchscans 
to differ when reuse of exchange check happens in AQE.
Moreover the equivalence of SubqueryAdaptiveBroadcastExec and 
SubqueryBroadcastExec  is not there which also aggravates the re-use of 
exchange in aqe broken.

  was:
while writing bug test for  [SPARK-45866|http://example.com], found that 
canonicalization of SubqueryAdaptiveBroadcastExec is broken in the sense that 
buildPlan : LogicalPlan is not canonicalized which causes batchscans to differ 
when reuse of exchange check happens in AQE.
Moreover the equivalence of SubqueryAdaptiveBroadcastExec and 
SubqueryBroadcastExec  is not there which also aggravates the re-use of 
exchange in aqe broken.


> Canonicalization of SubqueryAdaptiveBroadcastExec is broken and is not 
> equivalent with SubqueryBroadcastExec
> 
>
> Key: SPARK-45924
> URL: https://issues.apache.org/jira/browse/SPARK-45924
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.1
>Reporter: Asif
>Priority: Major
>
> while writing bug test for  
> [SPARK-45866|https://issues.apache.org/jira/projects/SPARK/issues/SPARK-45866],
>  found that canonicalization of SubqueryAdaptiveBroadcastExec is broken in 
> the sense that buildPlan : LogicalPlan is not canonicalized which causes 
> batchscans to differ when reuse of exchange check happens in AQE.
> Moreover the equivalence of SubqueryAdaptiveBroadcastExec and 
> SubqueryBroadcastExec  is not there which also aggravates the re-use of 
> exchange in aqe broken.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-45924) Canonicalization of SubqueryAdaptiveBroadcastExec is broken and is not equivalent with SubqueryBroadcastExec

2023-11-14 Thread Asif (Jira)

Asif created SPARK-45924:


 Summary: Canonicalization of SubqueryAdaptiveBroadcastExec is 
broken and is not equivalent with SubqueryBroadcastExec
 Key: SPARK-45924
 URL: https://issues.apache.org/jira/browse/SPARK-45924
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.5.1
Reporter: Asif


while writing bug test for  [SPARK-45866|http://example.com], found that 
canonicalization of SubqueryAdaptiveBroadcastExec is broken in the sense that 
buildPlan : LogicalPlan is not canonicalized which causes batchscans to differ 
when reuse of exchange check happens in AQE.
Moreover the equivalence of SubqueryAdaptiveBroadcastExec and 
SubqueryBroadcastExec  is not there which also aggravates the re-use of 
exchange in aqe broken.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45373) Minimizing calls to HiveMetaStore layer for getting partitions, when tables are repeated

2023-11-09 Thread Asif (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asif updated SPARK-45373:
-
Shepherd:   (was: Peter Toth)

> Minimizing calls to HiveMetaStore layer for getting partitions,  when tables 
> are repeated
> -
>
> Key: SPARK-45373
> URL: https://issues.apache.org/jira/browse/SPARK-45373
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Asif
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.5.1
>
>
> In the rule PruneFileSourcePartitions where the CatalogFileIndex gets 
> converted to InMemoryFileIndex,  the HMS calls can get very expensive if :
> 1) The translated filter string for push down to HMS layer becomes empty ,  
> resulting in fetching of all partitions and same table is referenced multiple 
> times in the query. 
> 2) Or just in case same table is referenced multiple times in the query with 
> different partition filters.
> In such cases current code would result in multiple calls to HMS layer. 
> This can be avoided by grouping the tables based on CatalogFileIndex and 
> passing a common minimum filter ( filter1 || filter2) and getting a base 
> PrunedInmemoryFileIndex which can become a basis for each of the specific 
> table.
> Opened following PR for ticket:
> [SPARK-45373-PR|https://github.com/apache/spark/pull/43183]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33152) SPIP: Constraint Propagation code causes OOM issues or increasing compilation time to hours

2023-11-09 Thread Asif (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asif updated SPARK-33152:
-
Affects Version/s: 3.5.0
   (was: 2.4.0)
   (was: 3.0.1)
   (was: 3.1.2)

> SPIP: Constraint Propagation code causes OOM issues or increasing compilation 
> time to hours
> ---
>
> Key: SPARK-33152
> URL: https://issues.apache.org/jira/browse/SPARK-33152
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Asif
>Priority: Major
>  Labels: SPIP
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> h2. Q1. What are you trying to do? Articulate your objectives using 
> absolutely no jargon.
> Proposing new algorithm to create, store and use constraints for removing 
> redundant filters & inferring new filters.
> The current algorithm has subpar performance in complex expression scenarios 
> involving aliases( with certain use cases the compilation time can go into 
> hours), potential to cause OOM, may miss removing redundant filters in 
> different scenarios, may miss creating IsNotNull constraints in different 
> scenarios, does not push compound predicates in Join.
>  # This issue if not fixed can cause OutOfMemory issue or unacceptable query 
> compilation times.
> Have added a test "plan equivalence with case statements and performance 
> comparison with benefit of more than 10x conservatively" in 
> org.apache.spark.sql.catalyst.plans.OptimizedConstraintPropagationSuite. 
> *With this PR the compilation time is 247 ms vs 13958 ms without the change*
>  # It is more effective in filter pruning as is evident in some of the tests 
> in org.apache.spark.sql.catalyst.plans.OptimizedConstraintPropagationSuite 
> where current code is not able to identify the redundant filter in some cases.
>  # It is able to generate a better optimized plan for join queries as it can 
> push compound predicates.
>  # The current logic can miss a lot of possible cases of removing redundant 
> predicates, as it fails to take into account if same attribute or its aliases 
> are repeated multiple times in a complex expression.
>  # There are cases where some of the optimizer rules involving removal of 
> redundant predicates fail to remove on the basis of constraint data. In some 
> cases the rule works, just by the virtue of previous rules helping it out to 
> cover the inaccuracy. That the ConstraintPropagation rule & its function of 
> removal of redundant filters & addition of new inferred filters is dependent 
> on the working of some of the other unrelated previous optimizer rules is 
> behaving, is indicative of issues.
>  # It does away with all the EqualNullSafe constraints as this logic does not 
> need those constraints to be created.
>  # There is at least one test in existing ConstraintPropagationSuite which is 
> missing a IsNotNull constraints because the code incorrectly generated a 
> EqualsNullSafeConstraint instead of EqualTo constraint, when using the 
> existing Constraints code. With these changes, the test correctly creates an 
> EqualTo constraint, resulting in an inferred IsNotNull constraint
>  # It does away with the current combinatorial logic of evaluation all the 
> constraints can cause compilation to run into hours or cause OOM. The number 
> of constraints stored is exactly the same as the number of filters encountered
> h2. Q2. What problem is this proposal NOT designed to solve?
> It mainly focuses on compile time performance, but in some cases can benefit 
> run time characteristics too, like inferring IsNotNull filter or pushing down 
> compound predicates on the join, which currently may get missed/ does not 
> happen , respectively, by the present code.
> h2. Q3. How is it done today, and what are the limits of current practice?
> Current ConstraintsPropagation code, pessimistically tries to generates all 
> the possible combinations of constraints , based on the aliases ( even then 
> it may miss a lot of combinations if the expression is a complex expression 
> involving same attribute repeated multiple times within the expression and 
> there are many aliases to that column). There are query plans in our 
> production env, which can result in intermediate number of constraints going 
> into hundreds of thousands, causing OOM or taking time running into hours. 
> Also there are cases where it incorrectly generates an EqualNullSafe 
> constraint instead of EqualTo constraint , thus missing a possible IsNull 
> constraint on column. 
> Also it only pushes single column predicate on the other side of the join.
> The constraints generated , in some cases, are missing the required ones, and 
> the plan

[jira] [Updated] (SPARK-45373) Minimizing calls to HiveMetaStore layer for getting partitions, when tables are repeated

2023-11-09 Thread Asif (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asif updated SPARK-45373:
-
Affects Version/s: 3.5.0
   (was: 4.0.0)

> Minimizing calls to HiveMetaStore layer for getting partitions,  when tables 
> are repeated
> -
>
> Key: SPARK-45373
> URL: https://issues.apache.org/jira/browse/SPARK-45373
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Asif
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.5.1
>
>
> In the rule PruneFileSourcePartitions where the CatalogFileIndex gets 
> converted to InMemoryFileIndex,  the HMS calls can get very expensive if :
> 1) The translated filter string for push down to HMS layer becomes empty ,  
> resulting in fetching of all partitions and same table is referenced multiple 
> times in the query. 
> 2) Or just in case same table is referenced multiple times in the query with 
> different partition filters.
> In such cases current code would result in multiple calls to HMS layer. 
> This can be avoided by grouping the tables based on CatalogFileIndex and 
> passing a common minimum filter ( filter1 || filter2) and getting a base 
> PrunedInmemoryFileIndex which can become a basis for each of the specific 
> table.
> Opened following PR for ticket:
> [SPARK-45373-PR|https://github.com/apache/spark/pull/43183]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44662) SPIP: Improving performance of BroadcastHashJoin queries with stream side join key on non partition columns

2023-11-09 Thread Asif (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asif updated SPARK-44662:
-
Affects Version/s: 3.5.0
   (was: 3.5.1)

> SPIP: Improving performance of BroadcastHashJoin queries with stream side 
> join key on non partition columns
> ---
>
> Key: SPARK-44662
> URL: https://issues.apache.org/jira/browse/SPARK-44662
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Asif
>Priority: Major
>  Labels: pull-request-available
> Attachments: perf results broadcast var pushdown - Partitioned 
> TPCDS.pdf
>
>
> h2. *Q1. What are you trying to do? Articulate your objectives using 
> absolutely no jargon.*
> On the lines of DPP which helps DataSourceV2 relations when the joining key 
> is a partition column, the same concept can be extended over to the case 
> where joining key is not a partition column.
> The Keys of BroadcastHashJoin are already available before actual evaluation 
> of the stream iterator. These keys can be pushed down to the DataSource as a 
> SortedSet.
> For non partition columns, the DataSources like iceberg have max/min stats on 
> column available at manifest level, and for formats like parquet , they have 
> max/min stats at various storage level. The passed SortedSet can be used to 
> prune using ranges at both driver level ( manifests files) as well as 
> executor level ( while actually going through chunks , row groups etc at 
> parquet level)
> If the data is stored as Columnar Batch format , then it would not be 
> possible to filter out individual row at DataSource level, even though we 
> have keys.
> But at the scan level, ( ColumnToRowExec) it is still possible to filter out 
> as many rows as possible , if the query involves nested joins. Thus reducing 
> the number of rows to join at the higher join levels.
> Will be adding more details..
> h2. *Q2. What problem is this proposal NOT designed to solve?*
> This can only help in BroadcastHashJoin's performance if the join is Inner or 
> Left Semi.
> This will also not work if there are nodes like Expand, Generator , Aggregate 
> (without group by on keys not part of joining column etc) below the 
> BroadcastHashJoin node being targeted.
> h2. *Q3. How is it done today, and what are the limits of current practice?*
> Currently this sort of pruning at DataSource level is being done using DPP 
> (Dynamic Partition Pruning ) and IFF one of the join key column is a 
> Partitioning column ( so that cost of DPP query is justified and way less 
> than amount of data it will be filtering by skipping partitions).
> The limitation is that DPP type approach is not implemented ( intentionally I 
> believe), if the join column is a non partition column ( because of cost of 
> "DPP type" query would most likely be way high as compared to any possible 
> pruning ( especially if the column is not stored in a sorted manner).
> h2. *Q4. What is new in your approach and why do you think it will be 
> successful?*
> 1) This allows pruning on non partition column based joins. 
> 2) Because it piggy backs on Broadcasted Keys, there is no extra cost of "DPP 
> type" query. 
> 3) The Data can be used by DataSource to prune at driver (possibly) and also 
> at executor level ( as in case of parquet which has max/min at various 
> structure levels)
> 4) The big benefit should be seen in multilevel nested join queries. In the 
> current code base, if I am correct, only one join's pruning filter would get 
> pushed at scan level. Since it is on partition key may be that is sufficient. 
> But if it is a nested Join query , and may be involving different columns on 
> streaming side for join, each such filter push could do significant pruning. 
> This requires some handling in case of AQE, as the stream side iterator ( & 
> hence stage evaluation needs to be delayed, till all the available join 
> filters in the nested tree are pushed at their respective target 
> BatchScanExec).
> h4. *Single Row Filteration*
> 5) In case of nested broadcasted joins, if the datasource is column vector 
> oriented , then what spark would get is a ColumnarBatch. But because scans 
> have Filters from multiple joins, they can be retrieved and can be applied in 
> code generated at ColumnToRowExec level, using a new "containsKey" method on 
> HashedRelation. Thus only those rows which satisfy all the 
> BroadcastedHashJoins ( whose keys have been pushed) , will be used for join 
> evaluation.
> The code is already there , the PR on spark side is 
> [spark-broadcast-var|https://github.com/apache/spark/pull/43373]. For non 
> partition table TPCDS run on laptop with TPCDS data size of ( scale factor

[jira] [Created] (SPARK-45866) Reuse of exchange in AQE does not happen when run time filters are pushed down to the underlying Scan ( like iceberg )

2023-11-09 Thread Asif (Jira)

Asif created SPARK-45866:


 Summary: Reuse of exchange in AQE does not happen when run time 
filters are pushed down to the underlying Scan ( like iceberg )
 Key: SPARK-45866
 URL: https://issues.apache.org/jira/browse/SPARK-45866
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.5.1
Reporter: Asif


In certain types of queries for eg TPCDS Query 14b,  the reuse of exchange does 
not happen in AQE , resulting in perf degradation.
The spark TPCDS tests are unable to catch the problem, because the InMemoryScan 
used for testing do not implement the equals & hashCode correctly , in the 
sense , that they do take into account the pushed down run time filters.

In concrete Scan implementations, for eg iceberg's SparkBatchQueryScan , the 
equality check , apart from other things, also involves Runtime Filters pushed 
( which is correct).

In spark the issue is this:
For a given stage being materialized,  just before materialization starts, the 
run time filters are confined to the BatchScanExec level.
Only when the actual RDD corresponding to the BatchScanExec, is being 
evaluated,  do the runtime filters get pushed to the underlying Scan.

Now if a new stage is created and it checks in the stageCache using its 
canonicalized plan to see if a stage can be reused, it fails to find the 
r-usable  stage even if the stage exists, because the canonicalized spark plan 
present in the stage cache, has now the run time filters pushed to the Scan , 
so the incoming canonicalized spark plan does not match the key as their 
underlying scans differ . that is incoming spark plan's scan does not have 
runtime filters , while the canonicalized spark plan present as key in the 
stage cache has the scan with runtime filters pushed.

The fix as I have worked is to provide, two methods in the 
SupportsRuntimeV2Filtering interface ,
default boolean equalToIgnoreRuntimeFilters(Scan other) {
return this.equals(other);
  }

  default int hashCodeIgnoreRuntimeFilters() {
return this.hashCode();
  }

In the BatchScanExec, if the scan implements SupportsRuntimeV2Filtering, then 
instead of batch.equals, it should call scan.equalToIgnoreRuntimeFilters

And the underlying Scan implementations should provide equality which excludes 
run time filters.

Similarly the hashCode of BatchScanExec, should use 
scan.hashCodeIgnoreRuntimeFilters instead of ( batch.hashCode).

Will be creating a PR with bug test for review.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-45658) Canonicalization of DynamicPruningSubquery is broken

2023-11-09 Thread Asif (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-45658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17784567#comment-17784567
 ] 

Asif commented on SPARK-45658:
--

I also think that during canonicalization of DynamicPruningSubquery, the 
pruning key's canonicalization should be done on the basis of the enclosing 
Plan which contains the DynamicPruningSubquery Expression

> Canonicalization of DynamicPruningSubquery is broken
> 
>
> Key: SPARK-45658
> URL: https://issues.apache.org/jira/browse/SPARK-45658
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Asif
>Priority: Major
>  Labels: pull-request-available
>
> The canonicalization of (buildKeys: Seq[Expression]) in the class 
> DynamicPruningSubquery is broken, as the buildKeys are canonicalized just by 
> calling 
> buildKeys.map(_.canonicalized)
> The  above would result in incorrect canonicalization as it would not be 
> normalizing the exprIds relative to buildQuery output
> The fix is to use the buildQuery : LogicalPlan's output to normalize the 
> buildKeys expression
> as given below, using the standard approach.
> buildKeys.map(QueryPlan.normalizeExpressions(_, buildQuery.output)),
> Will be filing a PR and bug test for the same.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44662) SPIP: Improving performance of BroadcastHashJoin queries with stream side join key on non partition columns

2023-11-08 Thread Asif (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17784282#comment-17784282
 ] 

Asif commented on SPARK-44662:
--

The changes for iceberg which support broadcast-var-pushdown are present in the 
git repo:
[iceberg-repo|https://github.com/ahshahid/iceberg.git]
branch : broadcastvar-push.
The changes done in the iceberg branch are compatible with latest apache/spark 
master ( identified as 3.5 to iceberg) and tested and compiled using scala 2.13.
To get the iceberg-spark-run-time jar for use:

First locally install the spark jars using the PR of spark mentioned below.
(./build/mvn clean install -Phive -Phive-thriftserver -DskipTests)
Then use the iceberg branch broadcastvar-push to create the iceberg spark 
runtime jar such that it uses the locally installed spark as dependency.

In case you are interested in evaluating performance, pls let me know.

> SPIP: Improving performance of BroadcastHashJoin queries with stream side 
> join key on non partition columns
> ---
>
> Key: SPARK-44662
> URL: https://issues.apache.org/jira/browse/SPARK-44662
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.1
>Reporter: Asif
>Priority: Major
>  Labels: pull-request-available
> Attachments: perf results broadcast var pushdown - Partitioned 
> TPCDS.pdf
>
>
> h2. *Q1. What are you trying to do? Articulate your objectives using 
> absolutely no jargon.*
> On the lines of DPP which helps DataSourceV2 relations when the joining key 
> is a partition column, the same concept can be extended over to the case 
> where joining key is not a partition column.
> The Keys of BroadcastHashJoin are already available before actual evaluation 
> of the stream iterator. These keys can be pushed down to the DataSource as a 
> SortedSet.
> For non partition columns, the DataSources like iceberg have max/min stats on 
> column available at manifest level, and for formats like parquet , they have 
> max/min stats at various storage level. The passed SortedSet can be used to 
> prune using ranges at both driver level ( manifests files) as well as 
> executor level ( while actually going through chunks , row groups etc at 
> parquet level)
> If the data is stored as Columnar Batch format , then it would not be 
> possible to filter out individual row at DataSource level, even though we 
> have keys.
> But at the scan level, ( ColumnToRowExec) it is still possible to filter out 
> as many rows as possible , if the query involves nested joins. Thus reducing 
> the number of rows to join at the higher join levels.
> Will be adding more details..
> h2. *Q2. What problem is this proposal NOT designed to solve?*
> This can only help in BroadcastHashJoin's performance if the join is Inner or 
> Left Semi.
> This will also not work if there are nodes like Expand, Generator , Aggregate 
> (without group by on keys not part of joining column etc) below the 
> BroadcastHashJoin node being targeted.
> h2. *Q3. How is it done today, and what are the limits of current practice?*
> Currently this sort of pruning at DataSource level is being done using DPP 
> (Dynamic Partition Pruning ) and IFF one of the join key column is a 
> Partitioning column ( so that cost of DPP query is justified and way less 
> than amount of data it will be filtering by skipping partitions).
> The limitation is that DPP type approach is not implemented ( intentionally I 
> believe), if the join column is a non partition column ( because of cost of 
> "DPP type" query would most likely be way high as compared to any possible 
> pruning ( especially if the column is not stored in a sorted manner).
> h2. *Q4. What is new in your approach and why do you think it will be 
> successful?*
> 1) This allows pruning on non partition column based joins. 
> 2) Because it piggy backs on Broadcasted Keys, there is no extra cost of "DPP 
> type" query. 
> 3) The Data can be used by DataSource to prune at driver (possibly) and also 
> at executor level ( as in case of parquet which has max/min at various 
> structure levels)
> 4) The big benefit should be seen in multilevel nested join queries. In the 
> current code base, if I am correct, only one join's pruning filter would get 
> pushed at scan level. Since it is on partition key may be that is sufficient. 
> But if it is a nested Join query , and may be involving different columns on 
> streaming side for join, each such filter push could do significant pruning. 
> This requires some handling in case of AQE, as the stream side iterator ( & 
> hence stage evaluation needs to be delayed, till all the available join 
> filters in the nested tree are pushed at their respective target 
>

[jira] [Commented] (SPARK-44662) SPIP: Improving performance of BroadcastHashJoin queries with stream side join key on non partition columns

2023-11-04 Thread Asif (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17782927#comment-17782927
 ] 

Asif commented on SPARK-44662:
--

The majority of file changes are due to additional tpcds tests for iceberg. 
These will not be included as such in final PR

> SPIP: Improving performance of BroadcastHashJoin queries with stream side 
> join key on non partition columns
> ---
>
> Key: SPARK-44662
> URL: https://issues.apache.org/jira/browse/SPARK-44662
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.1
>Reporter: Asif
>Priority: Major
>  Labels: pull-request-available
> Attachments: perf results broadcast var pushdown - Partitioned 
> TPCDS.pdf
>
>
> h2. *Q1. What are you trying to do? Articulate your objectives using 
> absolutely no jargon.*
> On the lines of DPP which helps DataSourceV2 relations when the joining key 
> is a partition column, the same concept can be extended over to the case 
> where joining key is not a partition column.
> The Keys of BroadcastHashJoin are already available before actual evaluation 
> of the stream iterator. These keys can be pushed down to the DataSource as a 
> SortedSet.
> For non partition columns, the DataSources like iceberg have max/min stats on 
> column available at manifest level, and for formats like parquet , they have 
> max/min stats at various storage level. The passed SortedSet can be used to 
> prune using ranges at both driver level ( manifests files) as well as 
> executor level ( while actually going through chunks , row groups etc at 
> parquet level)
> If the data is stored as Columnar Batch format , then it would not be 
> possible to filter out individual row at DataSource level, even though we 
> have keys.
> But at the scan level, ( ColumnToRowExec) it is still possible to filter out 
> as many rows as possible , if the query involves nested joins. Thus reducing 
> the number of rows to join at the higher join levels.
> Will be adding more details..
> h2. *Q2. What problem is this proposal NOT designed to solve?*
> This can only help in BroadcastHashJoin's performance if the join is Inner or 
> Left Semi.
> This will also not work if there are nodes like Expand, Generator , Aggregate 
> (without group by on keys not part of joining column etc) below the 
> BroadcastHashJoin node being targeted.
> h2. *Q3. How is it done today, and what are the limits of current practice?*
> Currently this sort of pruning at DataSource level is being done using DPP 
> (Dynamic Partition Pruning ) and IFF one of the join key column is a 
> Partitioning column ( so that cost of DPP query is justified and way less 
> than amount of data it will be filtering by skipping partitions).
> The limitation is that DPP type approach is not implemented ( intentionally I 
> believe), if the join column is a non partition column ( because of cost of 
> "DPP type" query would most likely be way high as compared to any possible 
> pruning ( especially if the column is not stored in a sorted manner).
> h2. *Q4. What is new in your approach and why do you think it will be 
> successful?*
> 1) This allows pruning on non partition column based joins. 
> 2) Because it piggy backs on Broadcasted Keys, there is no extra cost of "DPP 
> type" query. 
> 3) The Data can be used by DataSource to prune at driver (possibly) and also 
> at executor level ( as in case of parquet which has max/min at various 
> structure levels)
> 4) The big benefit should be seen in multilevel nested join queries. In the 
> current code base, if I am correct, only one join's pruning filter would get 
> pushed at scan level. Since it is on partition key may be that is sufficient. 
> But if it is a nested Join query , and may be involving different columns on 
> streaming side for join, each such filter push could do significant pruning. 
> This requires some handling in case of AQE, as the stream side iterator ( & 
> hence stage evaluation needs to be delayed, till all the available join 
> filters in the nested tree are pushed at their respective target 
> BatchScanExec).
> h4. *Single Row Filteration*
> 5) In case of nested broadcasted joins, if the datasource is column vector 
> oriented , then what spark would get is a ColumnarBatch. But because scans 
> have Filters from multiple joins, they can be retrieved and can be applied in 
> code generated at ColumnToRowExec level, using a new "containsKey" method on 
> HashedRelation. Thus only those rows which satisfy all the 
> BroadcastedHashJoins ( whose keys have been pushed) , will be used for join 
> evaluation.
> The code is already there , the PR on spark side is 
>

[jira] [Updated] (SPARK-44662) SPIP: Improving performance of BroadcastHashJoin queries with stream side join key on non partition columns

2023-11-04 Thread Asif (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asif updated SPARK-44662:
-
Attachment: perf results broadcast var pushdown - Partitioned TPCDS.pdf

> SPIP: Improving performance of BroadcastHashJoin queries with stream side 
> join key on non partition columns
> ---
>
> Key: SPARK-44662
> URL: https://issues.apache.org/jira/browse/SPARK-44662
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.1
>Reporter: Asif
>Priority: Major
>  Labels: pull-request-available
> Attachments: perf results broadcast var pushdown - Partitioned 
> TPCDS.pdf
>
>
> h2. *Q1. What are you trying to do? Articulate your objectives using 
> absolutely no jargon.*
> On the lines of DPP which helps DataSourceV2 relations when the joining key 
> is a partition column, the same concept can be extended over to the case 
> where joining key is not a partition column.
> The Keys of BroadcastHashJoin are already available before actual evaluation 
> of the stream iterator. These keys can be pushed down to the DataSource as a 
> SortedSet.
> For non partition columns, the DataSources like iceberg have max/min stats on 
> column available at manifest level, and for formats like parquet , they have 
> max/min stats at various storage level. The passed SortedSet can be used to 
> prune using ranges at both driver level ( manifests files) as well as 
> executor level ( while actually going through chunks , row groups etc at 
> parquet level)
> If the data is stored as Columnar Batch format , then it would not be 
> possible to filter out individual row at DataSource level, even though we 
> have keys.
> But at the scan level, ( ColumnToRowExec) it is still possible to filter out 
> as many rows as possible , if the query involves nested joins. Thus reducing 
> the number of rows to join at the higher join levels.
> Will be adding more details..
> h2. *Q2. What problem is this proposal NOT designed to solve?*
> This can only help in BroadcastHashJoin's performance if the join is Inner or 
> Left Semi.
> This will also not work if there are nodes like Expand, Generator , Aggregate 
> (without group by on keys not part of joining column etc) below the 
> BroadcastHashJoin node being targeted.
> h2. *Q3. How is it done today, and what are the limits of current practice?*
> Currently this sort of pruning at DataSource level is being done using DPP 
> (Dynamic Partition Pruning ) and IFF one of the join key column is a 
> Partitioning column ( so that cost of DPP query is justified and way less 
> than amount of data it will be filtering by skipping partitions).
> The limitation is that DPP type approach is not implemented ( intentionally I 
> believe), if the join column is a non partition column ( because of cost of 
> "DPP type" query would most likely be way high as compared to any possible 
> pruning ( especially if the column is not stored in a sorted manner).
> h2. *Q4. What is new in your approach and why do you think it will be 
> successful?*
> 1) This allows pruning on non partition column based joins. 
> 2) Because it piggy backs on Broadcasted Keys, there is no extra cost of "DPP 
> type" query. 
> 3) The Data can be used by DataSource to prune at driver (possibly) and also 
> at executor level ( as in case of parquet which has max/min at various 
> structure levels)
> 4) The big benefit should be seen in multilevel nested join queries. In the 
> current code base, if I am correct, only one join's pruning filter would get 
> pushed at scan level. Since it is on partition key may be that is sufficient. 
> But if it is a nested Join query , and may be involving different columns on 
> streaming side for join, each such filter push could do significant pruning. 
> This requires some handling in case of AQE, as the stream side iterator ( & 
> hence stage evaluation needs to be delayed, till all the available join 
> filters in the nested tree are pushed at their respective target 
> BatchScanExec).
> h4. *Single Row Filteration*
> 5) In case of nested broadcasted joins, if the datasource is column vector 
> oriented , then what spark would get is a ColumnarBatch. But because scans 
> have Filters from multiple joins, they can be retrieved and can be applied in 
> code generated at ColumnToRowExec level, using a new "containsKey" method on 
> HashedRelation. Thus only those rows which satisfy all the 
> BroadcastedHashJoins ( whose keys have been pushed) , will be used for join 
> evaluation.
> The code is already there , the PR on spark side is 
> [spark-broadcast-var|https://github.com/apache/spark/pull/43373]. For non 
> partition table TPCDS run on laptop with TPCDS data size of ( scale

[jira] [Updated] (SPARK-44662) SPIP: Improving performance of BroadcastHashJoin queries with stream side join key on non partition columns

2023-11-04 Thread Asif (Jira)

[
https://issues.apache.org/jira/browse/SPARK-44662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Asif updated SPARK-44662:
-
Description:
h2. *Q1. What are you trying to do? Articulate your objectives using absolutely
no jargon.*

On the lines of DPP which helps DataSourceV2 relations when the joining key is
a partition column, the same concept can be extended over to the case where
joining key is not a partition column.
The Keys of BroadcastHashJoin are already available before actual evaluation of
the stream iterator. These keys can be pushed down to the DataSource as a
SortedSet.
For non partition columns, the DataSources like iceberg have max/min stats on
column available at manifest level, and for formats like parquet , they have
max/min stats at various storage level. The passed SortedSet can be used to
prune using ranges at both driver level ( manifests files) as well as executor
level ( while actually going through chunks , row groups etc at parquet level)

If the data is stored as Columnar Batch format , then it would not be possible
to filter out individual row at DataSource level, even though we have keys.
But at the scan level, ( ColumnToRowExec) it is still possible to filter out as
many rows as possible , if the query involves nested joins. Thus reducing the
number of rows to join at the higher join levels.

Will be adding more details..
h2. *Q2. What problem is this proposal NOT designed to solve?*

This can only help in BroadcastHashJoin's performance if the join is Inner or
Left Semi.
This will also not work if there are nodes like Expand, Generator , Aggregate
(without group by on keys not part of joining column etc) below the
BroadcastHashJoin node being targeted.
h2. *Q3. How is it done today, and what are the limits of current practice?*

Currently this sort of pruning at DataSource level is being done using DPP
(Dynamic Partition Pruning ) and IFF one of the join key column is a
Partitioning column ( so that cost of DPP query is justified and way less than
amount of data it will be filtering by skipping partitions).
The limitation is that DPP type approach is not implemented ( intentionally I
believe), if the join column is a non partition column ( because of cost of
"DPP type" query would most likely be way high as compared to any possible
pruning ( especially if the column is not stored in a sorted manner).
h2. *Q4. What is new in your approach and why do you think it will be
successful?*

1) This allows pruning on non partition column based joins.
2) Because it piggy backs on Broadcasted Keys, there is no extra cost of "DPP
type" query.
3) The Data can be used by DataSource to prune at driver (possibly) and also at
executor level ( as in case of parquet which has max/min at various structure
levels)

4) The big benefit should be seen in multilevel nested join queries. In the
current code base, if I am correct, only one join's pruning filter would get
pushed at scan level. Since it is on partition key may be that is sufficient.
But if it is a nested Join query , and may be involving different columns on
streaming side for join, each such filter push could do significant pruning.
This requires some handling in case of AQE, as the stream side iterator ( &
hence stage evaluation needs to be delayed, till all the available join filters
in the nested tree are pushed at their respective target BatchScanExec).
h4. *Single Row Filteration*

5) In case of nested broadcasted joins, if the datasource is column vector
oriented , then what spark would get is a ColumnarBatch. But because scans have
Filters from multiple joins, they can be retrieved and can be applied in code
generated at ColumnToRowExec level, using a new "containsKey" method on
HashedRelation. Thus only those rows which satisfy all the BroadcastedHashJoins
( whose keys have been pushed) , will be used for join evaluation.

The code is already there , the PR on spark side is
[spark-broadcast-var|https://github.com/apache/spark/pull/43373]. For non
partition table TPCDS run on laptop with TPCDS data size of ( scale factor 4),
I am seeing 15% gain.

For partition table TPCDS, there is improvement in 4 - 5 queries to the tune of
10% to 37%.
h2. *Q5. Who cares? If you are successful, what difference will it make?*

If use cases involve multiple joins especially when the join columns are non
partitioned, and performance is a criteria, this PR *might* help.
h2. *Q6. What are the risks?*

Well the changes are extensive. review will be painful . Though code is being
tested continuously and adding more tests , with big change, some possibility
of bugs is there. But as of now, I think the code is robust. To get the Perf
benefit fully, the pushed filters utilization needs to be implemented on the
DataSource side too. Have already done it for {*}iceberg{*}. But I believe
atleast in case of Nested Broadcast Hash Joins,

[jira] [Updated] (SPARK-44662) SPIP: Improving performance of BroadcastHashJoin queries with stream side join key on non partition columns

2023-11-04 Thread Asif (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asif updated SPARK-44662:
-
Affects Version/s: 3.5.1
   (was: 3.3.3)

> SPIP: Improving performance of BroadcastHashJoin queries with stream side 
> join key on non partition columns
> ---
>
> Key: SPARK-44662
> URL: https://issues.apache.org/jira/browse/SPARK-44662
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.1
>Reporter: Asif
>Priority: Major
>  Labels: pull-request-available
>
> h2. *Q1. What are you trying to do? Articulate your objectives using 
> absolutely no jargon.*
> On the lines of DPP which helps DataSourceV2 relations when the joining key 
> is a partition column, the same concept can be extended over to the case 
> where joining key is not a partition column.
> The Keys of BroadcastHashJoin are already available before actual evaluation 
> of the stream iterator. These keys can be pushed down to the DataSource as a 
> SortedSet.
> For non partition columns, the DataSources like iceberg have max/min stats on 
> column available at manifest level, and for formats like parquet , they have 
> max/min stats at various storage level. The passed SortedSet can be used to 
> prune using ranges at both driver level ( manifests files) as well as 
> executor level ( while actually going through chunks , row groups etc at 
> parquet level)
> If the data is stored as Columnar Batch format , then it would not be 
> possible to filter out individual row at DataSource level, even though we 
> have keys.
> But at the scan level, ( ColumnToRowExec) it is still possible to filter out 
> as many rows as possible , if the query involves nested joins. Thus reducing 
> the number of rows to join at the higher join levels.
> Will be adding more details..
> h2. *Q2. What problem is this proposal NOT designed to solve?*
> This can only help in BroadcastHashJoin's performance if the join is Inner or 
> Left Semi.
> This will also not work if there are nodes like Expand, Generator , Aggregate 
> (without group by on keys not part of joining column etc) below the 
> BroadcastHashJoin node being targeted.
> h2. *Q3. How is it done today, and what are the limits of current practice?*
> Currently this sort of pruning at DataSource level is being done using DPP 
> (Dynamic Partition Pruning ) and IFF one of the join key column is a 
> Partitioning column ( so that cost of DPP query is justified and way less 
> than amount of data it will be filtering by skipping partitions).
> The limitation is that DPP type approach is not implemented ( intentionally I 
> believe), if the join column is a non partition column ( because of cost of 
> "DPP type" query would most likely be way high as compared to any possible 
> pruning ( especially if the column is not stored in a sorted manner).
> h2. *Q4. What is new in your approach and why do you think it will be 
> successful?*
> 1) This allows pruning on non partition column based joins. 
> 2) Because it piggy backs on Broadcasted Keys, there is no extra cost of "DPP 
> type" query. 
> 3) The Data can be used by DataSource to prune at driver (possibly) and also 
> at executor level ( as in case of parquet which has max/min at various 
> structure levels)
> 4) The big benefit should be seen in multilevel nested join queries. In the 
> current code base, if I am correct, only one join's pruning filter would get 
> pushed at scan level. Since it is on partition key may be that is sufficient. 
> But if it is a nested Join query , and may be involving different columns on 
> streaming side for join, each such filter push could do significant pruning. 
> This requires some handling in case of AQE, as the stream side iterator ( & 
> hence stage evaluation needs to be delayed, till all the available join 
> filters in the nested tree are pushed at their respective target 
> BatchScanExec).
> h4. *Single Row Filteration*
> 5) In case of nested broadcasted joins, if the datasource is column vector 
> oriented , then what spark would get is a ColumnarBatch. But because scans 
> have Filters from multiple joins, they can be retrieved and can be applied in 
> code generated at ColumnToRowExec level, using a new "containsKey" method on 
> HashedRelation. Thus only those rows which satisfy all the 
> BroadcastedHashJoins ( whose keys have been pushed) , will be used for join 
> evaluation.
> The code is already there , the PR on spark side is 
> [spark-broadcast-var|https://github.com/apache/spark/pull/43373]. For non 
> partition table TPCDS run on laptop with TPCDS data size of ( scale factor 
> 4), I am seeing 15% gain.
> For partition table TPCDS, there is improvement in 4 - 5

[jira] [Updated] (SPARK-44662) SPIP: Improving performance of BroadcastHashJoin queries with stream side join key on non partition columns

2023-11-04 Thread Asif (Jira)

[
https://issues.apache.org/jira/browse/SPARK-44662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Asif updated SPARK-44662:
-
Description:
h2. *Q1. What are you trying to do? Articulate your objectives using absolutely
no jargon.*

Will be adding more details..
h2. *Q2. What problem is this proposal NOT designed to solve?*

For partition table TPCDS, there is improvement in 4 - 5 queries to the tune of
10% to 37%.
h2. *Q5. Who cares? If you are successful, what difference will it make?*

If use cases involve multiple joins especially when the join columns are non
partitioned, and performance is a criteria, this PR *might* help.
h2. *Q6. What are the risks?*

[jira] [Updated] (SPARK-44662) SPIP: Improving performance of BroadcastHashJoin queries with stream side join key on non partition columns

2023-11-04 Thread Asif (Jira)

[
https://issues.apache.org/jira/browse/SPARK-44662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Asif updated SPARK-44662:
-
Description:
h2. *Q1. What are you trying to do? Articulate your objectives using absolutely
no jargon.*

Will be adding more details..
h2. *Q2. What problem is this proposal NOT designed to solve?*

For partition table TPCDS, there is improvement in 4 - 5 queries to the tune of
10% to 37%.
h2. *Q5. Who cares? If you are successful, what difference will it make?*

If use cases involve multiple joins especially when the join columns are non
partitioned, and performance is a criteria, this PR *might* help.
h2. *Q6. What are the risks?*

[jira] [Commented] (SPARK-36786) SPIP: Improving the compile time performance, by improving a couple of rules, from 24 hrs to under 8 minutes

2023-11-01 Thread Asif (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17781976#comment-17781976
 ] 

Asif commented on SPARK-36786:
--

I had put this on back burner as my changes were on 3.2, so I have to do a 
merge . on latest. Though whatever optimizations I did on 3.2 are still 
applicable as the drawback still exist. But chnages are going to be a a little 
extensive.
If there is interest on it I can pick up , after some days as right now 
occupied with another spip which proposes chnages for improving perf of 
broadcast hash joins on non partition column joins.
 

> SPIP: Improving the compile time performance, by improving  a couple of 
> rules, from 24 hrs to under 8 minutes
> -
>
> Key: SPARK-36786
> URL: https://issues.apache.org/jira/browse/SPARK-36786
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.1, 3.1.2
>Reporter: Asif
>Priority: Major
>  Labels: SPIP
>
> h2. Q1. What are you trying to do? Articulate your objectives using 
> absolutely no jargon.
> The aim is to improve the compile time performance of query which in 
> WorkDay's use case takes > 24 hrs ( & eventually fails) , to  < 8 min.
> To explain the problem, I will provide the context.
> The query plan in our production system, is huge, with nested *case when* 
> expressions ( level of nesting could be >  8) , where each *case when* can 
> have branches sometimes > 1000.
> The plan could look like
> {quote}Project1
>     |
>    Filter 1
>     |
> Project2
>     |
>  Filter2
>     |
>  Project3
>     |
>  Filter3
>   |
> Join
> {quote}
> Now the optimizer has a Batch of Rules , intended to run at max 100 times.
> *Also note that the, the batch will continue to run till one of the condition 
> is satisfied*
> *i.e  either numIter == 100 || inputPlan == outputPlan (idempotency is 
> achieved)*
> One of the early  Rule is   *PushDownPredicateRule.*
> **Followed by **CollapseProject**.
>  
> The first issue is *PushDownPredicate* rule.
> It picks  one filter at a time & pushes it at lowest level ( I understand 
> that in 3.1 it pushes through join, while in 2.4 it stops at Join) , but 
> either case it picks 1 filter at time starting from top, in each iteration.
> *The above comment is no longer true in 3.1 release as it now combines 
> filters. so it does push now all the encountered filters in a single pass. 
> But it still materializes the filter on each push by realiasing.*
> So if there are say  50 projects interspersed with Filters , the idempotency 
> is guaranteedly not going to get achieved till around 49 iterations. 
> Moreover, CollapseProject will also be modifying tree on each iteration as a 
> filter will get removed within Project.
> Moreover, on each movement of filter through project tree, the filter is 
> re-aliased using transformUp rule.  transformUp is very expensive compared to 
> transformDown. As the filter keeps getting pushed down , its size increases.
> To optimize this rule , 2 things are needed
>  # Instead of pushing one filter at a time,  collect all the filters as we 
> traverse the tree in that iteration itself.
>  # Do not re-alias the filters on each push. Collect the sequence of projects 
> it has passed through, and  when the filters have reached their resting 
> place, do the re-alias by processing the projects collected in down to up 
> manner.
> This will result in achieving idempotency in a couple of iterations. 
> *How reducing the number of iterations help in performance*
> There are many rules like *NullPropagation, OptimizeIn, SimplifyConditionals 
> ( ... there are around 6 more such rules)*  which traverse the tree using 
> transformUp, and they run unnecessarily in each iteration , even when the 
> expressions in an operator have not changed since the previous runs.
> *I have a different proposal which I will share later, as to how to avoid the 
> above rules from running unnecessarily, if it can be guaranteed that the 
> expression is not going to mutate in the operator.* 
> The cause of our huge compilation time has been identified as the above.
>   
> h2. Q2. What problem is this proposal NOT designed to solve?
> It is not going to change any runtime profile.
> h2. Q3. How is it done today, and what are the limits of current practice?
> Like mentioned above , currently PushDownPredicate pushes one filter at a 
> time  & at each Project , it materialized the re-aliased filter.  This 
> results in large number of iterations to achieve idempotency as well as 
> immediate materialization of Filter after each Project pass,, results in 
> unnecessary tree traversals of filter expression that too using transformUp. 
> and the expression tree of filter is bound to keep

[jira] [Updated] (SPARK-45658) Canonicalization of DynamicPruningSubquery is broken

2023-10-24 Thread Asif (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asif updated SPARK-45658:
-
Description: 
The canonicalization of (buildKeys: Seq[Expression]) in the class 
DynamicPruningSubquery is broken, as the buildKeys are canonicalized just by 
calling 
buildKeys.map(_.canonicalized)
The  above would result in incorrect canonicalization as it would not be 
normalizing the exprIds relative to buildQuery output
The fix is to use the buildQuery : LogicalPlan's output to normalize the 
buildKeys expression
as given below, using the standard approach.

buildKeys.map(QueryPlan.normalizeExpressions(_, buildQuery.output)),

Will be filing a PR and bug test for the same.

  was:
The canonicalization of (buildKeys: Seq[Expression]) in the class 
DynamicPruningSubquery is broken, as the buildKeys are canonicalized just by 
calling 
buildKeys.map(_.canonicalized)
The  above would result in incorrect canonicalization as it would not be 
normalizing the exprIds
The fix is to use the buildQuery : LogicalPlan's output to normalize the 
buildKeys expression
as given below, using the standard approach.

buildKeys.map(QueryPlan.normalizeExpressions(_, buildQuery.output)),

Will be filing a PR and bug test for the same.


> Canonicalization of DynamicPruningSubquery is broken
> 
>
> Key: SPARK-45658
> URL: https://issues.apache.org/jira/browse/SPARK-45658
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0, 3.5.1
>Reporter: Asif
>Priority: Major
>
> The canonicalization of (buildKeys: Seq[Expression]) in the class 
> DynamicPruningSubquery is broken, as the buildKeys are canonicalized just by 
> calling 
> buildKeys.map(_.canonicalized)
> The  above would result in incorrect canonicalization as it would not be 
> normalizing the exprIds relative to buildQuery output
> The fix is to use the buildQuery : LogicalPlan's output to normalize the 
> buildKeys expression
> as given below, using the standard approach.
> buildKeys.map(QueryPlan.normalizeExpressions(_, buildQuery.output)),
> Will be filing a PR and bug test for the same.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45658) Canonicalization of DynamicPruningSubquery is broken

2023-10-24 Thread Asif (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asif updated SPARK-45658:
-
Priority: Major  (was: Critical)

> Canonicalization of DynamicPruningSubquery is broken
> 
>
> Key: SPARK-45658
> URL: https://issues.apache.org/jira/browse/SPARK-45658
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0, 3.5.1
>Reporter: Asif
>Priority: Major
>
> The canonicalization of (buildKeys: Seq[Expression]) in the class 
> DynamicPruningSubquery is broken, as the buildKeys are canonicalized just by 
> calling 
> buildKeys.map(_.canonicalized)
> The  above would result in incorrect canonicalization as it would not be 
> normalizing the exprIds
> The fix is to use the buildQuery : LogicalPlan's output to normalize the 
> buildKeys expression
> as given below, using the standard approach.
> buildKeys.map(QueryPlan.normalizeExpressions(_, buildQuery.output)),
> Will be filing a PR and bug test for the same.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-45658) Canonicalization of DynamicPruningSubquery is broken

2023-10-24 Thread Asif (Jira)

Asif created SPARK-45658:


 Summary: Canonicalization of DynamicPruningSubquery is broken
 Key: SPARK-45658
 URL: https://issues.apache.org/jira/browse/SPARK-45658
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.5.0, 3.5.1
Reporter: Asif


The canonicalization of (buildKeys: Seq[Expression]) in the class 
DynamicPruningSubquery is broken, as the buildKeys are canonicalized just by 
calling 
buildKeys.map(_.canonicalized)
The  above would result in incorrect canonicalization as it would not be 
normalizing the exprIds
The fix is to use the buildQuery : LogicalPlan's output to normalize the 
buildKeys expression
as given below, using the standard approach.

buildKeys.map(QueryPlan.normalizeExpressions(_, buildQuery.output)),

Will be filing a PR and bug test for the same.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45373) Minimizing calls to HiveMetaStore layer for getting partitions, when tables are repeated

2023-10-05 Thread Asif (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asif updated SPARK-45373:
-
Affects Version/s: 4.0.0
   (was: 3.5.1)

> Minimizing calls to HiveMetaStore layer for getting partitions,  when tables 
> are repeated
> -
>
> Key: SPARK-45373
> URL: https://issues.apache.org/jira/browse/SPARK-45373
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Asif
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.5.1
>
>
> In the rule PruneFileSourcePartitions where the CatalogFileIndex gets 
> converted to InMemoryFileIndex,  the HMS calls can get very expensive if :
> 1) The translated filter string for push down to HMS layer becomes empty ,  
> resulting in fetching of all partitions and same table is referenced multiple 
> times in the query. 
> 2) Or just in case same table is referenced multiple times in the query with 
> different partition filters.
> In such cases current code would result in multiple calls to HMS layer. 
> This can be avoided by grouping the tables based on CatalogFileIndex and 
> passing a common minimum filter ( filter1 || filter2) and getting a base 
> PrunedInmemoryFileIndex which can become a basis for each of the specific 
> table.
> Opened following PR for ticket:
> [SPARK-45373-PR|https://github.com/apache/spark/pull/43183]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45373) Minimizing calls to HiveMetaStore layer for getting partitions, when tables are repeated

2023-09-29 Thread Asif (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asif updated SPARK-45373:
-
Description: 
In the rule PruneFileSourcePartitions where the CatalogFileIndex gets converted 
to InMemoryFileIndex,  the HMS calls can get very expensive if :
1) The translated filter string for push down to HMS layer becomes empty ,  
resulting in fetching of all partitions and same table is referenced multiple 
times in the query. 
2) Or just in case same table is referenced multiple times in the query with 
different partition filters.
In such cases current code would result in multiple calls to HMS layer. 
This can be avoided by grouping the tables based on CatalogFileIndex and 
passing a common minimum filter ( filter1 || filter2) and getting a base 
PrunedInmemoryFileIndex which can become a basis for each of the specific table.

Opened following PR for ticket:
[SPARK-45373-PR|https://github.com/apache/spark/pull/43183]

  was:
In the rule PruneFileSourcePartitions where the CatalogFileIndex gets converted 
to InMemoryFileIndex,  the HMS calls can get very expensive if :
1) The translated filter string for push down to HMS layer becomes empty ,  
resulting in fetching of all partitions and same table is referenced multiple 
times in the query. 
2) Or just in case same table is referenced multiple times in the query with 
different partition filters.
In such cases current code would result in multiple calls to HMS layer. 
This can be avoided by grouping the tables based on CatalogFileIndex and 
passing a common minimum filter ( filter1 || filter2) and getting a base 
PrunedInmemoryFileIndex which can become a basis for each of the specific table.


> Minimizing calls to HiveMetaStore layer for getting partitions,  when tables 
> are repeated
> -
>
> Key: SPARK-45373
> URL: https://issues.apache.org/jira/browse/SPARK-45373
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.1
>Reporter: Asif
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.5.1
>
>
> In the rule PruneFileSourcePartitions where the CatalogFileIndex gets 
> converted to InMemoryFileIndex,  the HMS calls can get very expensive if :
> 1) The translated filter string for push down to HMS layer becomes empty ,  
> resulting in fetching of all partitions and same table is referenced multiple 
> times in the query. 
> 2) Or just in case same table is referenced multiple times in the query with 
> different partition filters.
> In such cases current code would result in multiple calls to HMS layer. 
> This can be avoided by grouping the tables based on CatalogFileIndex and 
> passing a common minimum filter ( filter1 || filter2) and getting a base 
> PrunedInmemoryFileIndex which can become a basis for each of the specific 
> table.
> Opened following PR for ticket:
> [SPARK-45373-PR|https://github.com/apache/spark/pull/43183]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] (SPARK-45373) Minimizing calls to HiveMetaStore layer for getting partitions, when tables are repeated

2023-09-29 Thread Asif (Jira)



[ https://issues.apache.org/jira/browse/SPARK-45373 ]


Asif deleted comment on SPARK-45373:
--

was (Author: ashahid7):
Will be generating a PR for this.

> Minimizing calls to HiveMetaStore layer for getting partitions,  when tables 
> are repeated
> -
>
> Key: SPARK-45373
> URL: https://issues.apache.org/jira/browse/SPARK-45373
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.1
>Reporter: Asif
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.5.1
>
>
> In the rule PruneFileSourcePartitions where the CatalogFileIndex gets 
> converted to InMemoryFileIndex,  the HMS calls can get very expensive if :
> 1) The translated filter string for push down to HMS layer becomes empty ,  
> resulting in fetching of all partitions and same table is referenced multiple 
> times in the query. 
> 2) Or just in case same table is referenced multiple times in the query with 
> different partition filters.
> In such cases current code would result in multiple calls to HMS layer. 
> This can be avoided by grouping the tables based on CatalogFileIndex and 
> passing a common minimum filter ( filter1 || filter2) and getting a base 
> PrunedInmemoryFileIndex which can become a basis for each of the specific 
> table.
> Opened following PR for ticket:
> [SPARK-45373-PR|https://github.com/apache/spark/pull/43183]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-45373) Minimizing calls to HiveMetaStore layer for getting partitions, when tables are repeated

2023-09-28 Thread Asif (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-45373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17770220#comment-17770220
 ] 

Asif commented on SPARK-45373:
--

Will be generating a PR for this.

> Minimizing calls to HiveMetaStore layer for getting partitions,  when tables 
> are repeated
> -
>
> Key: SPARK-45373
> URL: https://issues.apache.org/jira/browse/SPARK-45373
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.1
>Reporter: Asif
>Priority: Minor
> Fix For: 3.5.1
>
>
> In the rule PruneFileSourcePartitions where the CatalogFileIndex gets 
> converted to InMemoryFileIndex,  the HMS calls can get very expensive if :
> 1) The translated filter string for push down to HMS layer becomes empty ,  
> resulting in fetching of all partitions and same table is referenced multiple 
> times in the query. 
> 2) Or just in case same table is referenced multiple times in the query with 
> different partition filters.
> In such cases current code would result in multiple calls to HMS layer. 
> This can be avoided by grouping the tables based on CatalogFileIndex and 
> passing a common minimum filter ( filter1 || filter2) and getting a base 
> PrunedInmemoryFileIndex which can become a basis for each of the specific 
> table.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-45373) Minimizing calls to HiveMetaStore layer for getting partitions, when tables are repeated

2023-09-28 Thread Asif (Jira)

Asif created SPARK-45373:


 Summary: Minimizing calls to HiveMetaStore layer for getting 
partitions,  when tables are repeated
 Key: SPARK-45373
 URL: https://issues.apache.org/jira/browse/SPARK-45373
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.5.1
Reporter: Asif
 Fix For: 3.5.1


In the rule PruneFileSourcePartitions where the CatalogFileIndex gets converted 
to InMemoryFileIndex,  the HMS calls can get very expensive if :
1) The translated filter string for push down to HMS layer becomes empty ,  
resulting in fetching of all partitions and same table is referenced multiple 
times in the query. 
2) Or just in case same table is referenced multiple times in the query with 
different partition filters.
In such cases current code would result in multiple calls to HMS layer. 
This can be avoided by grouping the tables based on CatalogFileIndex and 
passing a common minimum filter ( filter1 || filter2) and getting a base 
PrunedInmemoryFileIndex which can become a basis for each of the specific table.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 136 matches

Mail list logo