Thanks for the reply!

To clarify, for issue 2, it could still break apart a query into multiple jobs 
without AQE — I have turned off the AQE in my posted example.

For 1, an end user just needs to turn on/off a knob to use the stage-level 
scheduling for Spark SQL — I am considering adding a component between the 
Spark SQL module and the Spark Core model to optimize the stage-level resource.

Yes, SQL is declarative. It uses a sequence of components (such as a logical 
planner, physical planner, and CBO) to get a selected physical plan. The RDDs 
(with the transformations) are generated based on the selected physical plan 
for execution. For now, we could only get the top-level RDD of the DAG of RDDs 
by `spark.sql(q1).queryExecution.toRdd`, but it is not enough to make 
stage-level scheduling decisions. The stage-level resources are profiled based 
on the RDDs. If we could expose the all RDDs instead of the top-level RDD, it 
seems possible to apply the stage-level scheduling here.


P.S. let me attach the link for the RDD regeneration explicitly in case it is 
not shown on the mail-list website: 
https://stackoverflow.com/questions/73895506/how-to-avoid-rdd-regeneration-in-spark-sql

Cheers,
Chenghao
On Sep 29, 2022, 5:22 PM +0200, Herman van Hovell <her...@databricks.com>, 
wrote:
> I think issue 2 is caused by adaptive query execution. This will break apart 
> queries into multiple jobs, each subsequent job will generate a RDD that is 
> based on previous ones.
>
> As for 1. I am not sure how much you want to expose to an end user here. SQL 
> is declarative, and it does not specify how a query should be executed. I can 
> imagine that you might use different resources for different types of stages, 
> e.g. a scan stage and more compute heavy stages. This, IMO, should be based 
> on analysis and costing the plan. For this RDD only stage level scheduling 
> should be sufficient.
>
> > On Thu, Sep 29, 2022 at 8:56 AM Chenghao Lyu <cheng...@cs.umass.edu> wrote:
> > > Hi,
> > >
> > > I plan to deploy the stage-level scheduling for Spark SQL to apply some 
> > > fine-grained optimizations over the DAG of stages. However, I am blocked 
> > > by the following issues:
> > >
> > > 1. The current stage-level scheduling supports RDD APIs only. So is there 
> > > a way to reuse the stage-level scheduling for Spark SQL? E.g., how to 
> > > expose the RDD code (the transformations and actions) from a Spark SQL 
> > > (with SQL syntax)?
> > > 2. We do not quite understand why a Spark SQL could trigger multiple 
> > > jobs, and have some RDDs regenerated, as posted in here. Can anyone give 
> > > us some insight on the reasons and whether we can avoid the RDD 
> > > regeneration to save execution time?
> > >
> > > Thanks in advance.
> > >
> > > Cheers,
> > > Chenghao

Reply via email to