[jira] [Created] (SPARK-41214) Sub-plan metrics under InMemoryRelation are missed when AQE Cached DataFrame Support is enabled

Eren Avsarogullari (Jira) Sun, 20 Nov 2022 22:23:05 -0800

Eren Avsarogullari created SPARK-41214:
------------------------------------------


             Summary: Sub-plan metrics under InMemoryRelation are missed when 
AQE Cached DataFrame Support is enabled
                 Key: SPARK-41214
                 URL: https://issues.apache.org/jira/browse/SPARK-41214
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 3.4.0
            Reporter: Eren Avsarogullari
         Attachments: DAG when AQE=ON and AQECachedDFSupport=ON with fix.png, 
DAG when AQE=ON and AQECachedDFSupport=ON without fix.png

*spark.sql.optimizer.canChangeCachedPlanOutputPartitioning* enables AQE 
optimizations under InMemoryRelation(IMR) nodes. Following sample query has IMR 
node on both BroadcastHashJoin legs. However, 
when spark.sql.optimizer.canChangeCachedPlanOutputPartitioning = true, 
following datas are missed due to lack of final sub-plans (under IMR) 
submissions (into UI).
{code:java}
- Physical operators' SQLMetrics (before AdaptiveSparkPlan) are missed such as 
Exchange and HashAggregate on left BHJ leg and HashAggregate on right BHJ leg,
- WSCG blocks are missed on left BHJ leg, 
- AQEShuffleRead node is missed on left BHJ leg. {code}
*Sample to reproduce:*
{code:java}
val spark = SparkSession
    .builder()
    .config("spark.sql.adaptive.enabled", "true")
    .config("spark.sql.optimizer.canChangeCachedPlanOutputPartitioning", "true")
    .master("local[*]")
    .getOrCreate()

import spark.implicits._

// Create 1th DF 
val arr = Seq(
  (1, "Employee_1", "Department_1"),
  (2, "Employee_2", "Department_2"))
val df = arr.toDF("id", "name", "department")
  .filter('id < 3)
  .groupBy('name)
  .count()
df.cache()

// Create 2th DF
val arr2 = Seq((1, "Employee_1", "Department_1"))
val df2 = arr2.toDF("id", "name", "department")
  .filter('id > 0)
  .groupBy('name)
  .count()
df2.cache()

// Trigger query execution
val df3 = df.join(df2, "name")
df3.show() {code}




 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-41214) Sub-plan metrics under InMemoryRelation are missed when AQE Cached DataFrame Support is enabled

Reply via email to