Thanks for the reply! To clarify, for issue 2, it could still break apart a query into multiple jobs without AQE — I have turned off the AQE in my posted example.
For 1, an end user just needs to turn on/off a knob to use the stage-level scheduling for Spark SQL — I am considering adding a component between the Spark SQL module and the Spark Core model to optimize the stage-level resource. Yes, SQL is declarative. It uses a sequence of components (such as a logical planner, physical planner, and CBO) to get a selected physical plan. The RDDs (with the transformations) are generated based on the selected physical plan for execution. For now, we could only get the top-level RDD of the DAG of RDDs by `spark.sql(q1).queryExecution.toRdd`, but it is not enough to make stage-level scheduling decisions. The stage-level resources are profiled based on the RDDs. If we could expose the all RDDs instead of the top-level RDD, it seems possible to apply the stage-level scheduling here. P.S. let me attach the link for the RDD regeneration explicitly in case it is not shown on the mail-list website: https://stackoverflow.com/questions/73895506/how-to-avoid-rdd-regeneration-in-spark-sql Cheers, Chenghao On Sep 29, 2022, 5:22 PM +0200, Herman van Hovell <her...@databricks.com>, wrote: > I think issue 2 is caused by adaptive query execution. This will break apart > queries into multiple jobs, each subsequent job will generate a RDD that is > based on previous ones. > > As for 1. I am not sure how much you want to expose to an end user here. SQL > is declarative, and it does not specify how a query should be executed. I can > imagine that you might use different resources for different types of stages, > e.g. a scan stage and more compute heavy stages. This, IMO, should be based > on analysis and costing the plan. For this RDD only stage level scheduling > should be sufficient. > > > On Thu, Sep 29, 2022 at 8:56 AM Chenghao Lyu <cheng...@cs.umass.edu> wrote: > > > Hi, > > > > > > I plan to deploy the stage-level scheduling for Spark SQL to apply some > > > fine-grained optimizations over the DAG of stages. However, I am blocked > > > by the following issues: > > > > > > 1. The current stage-level scheduling supports RDD APIs only. So is there > > > a way to reuse the stage-level scheduling for Spark SQL? E.g., how to > > > expose the RDD code (the transformations and actions) from a Spark SQL > > > (with SQL syntax)? > > > 2. We do not quite understand why a Spark SQL could trigger multiple > > > jobs, and have some RDDs regenerated, as posted in here. Can anyone give > > > us some insight on the reasons and whether we can avoid the RDD > > > regeneration to save execution time? > > > > > > Thanks in advance. > > > > > > Cheers, > > > Chenghao