I think issue 2 is caused by adaptive query execution. This will break
apart queries into multiple jobs, each subsequent job will generate a RDD
that is based on previous ones.

As for 1. I am not sure how much you want to expose to an end user here.
SQL is declarative, and it does not specify how a query should be executed.
I can imagine that you might use different resources for different types of
stages, e.g. a scan stage and more compute heavy stages. This, IMO, should
be based on analysis and costing the plan. For this RDD only stage level
scheduling should be sufficient.

On Thu, Sep 29, 2022 at 8:56 AM Chenghao Lyu <cheng...@cs.umass.edu> wrote:

> Hi,
>
> I plan to deploy the stage-level scheduling for Spark SQL to apply some
> fine-grained optimizations over the DAG of stages. However, I am blocked by
> the following issues:
>
>    1. The current stage-level scheduling
>    
> <https://spark.apache.org/docs/latest/configuration.html#stage-level-scheduling-overview>
>  supports
>    RDD APIs only. So is there a way to reuse the stage-level scheduling for
>    Spark SQL? E.g., how to expose the RDD code (the transformations and
>    actions) from a Spark SQL (with SQL syntax)?
>    2. We do not quite understand why a Spark SQL could trigger multiple
>    jobs, and have some RDDs regenerated, as posted in *here*
>    
> <https://stackoverflow.com/questions/73895506/how-to-avoid-rdd-regeneration-in-spark-sql>
>    . Can anyone give us some insight on the reasons and whether we can
>    avoid the RDD regeneration to save execution time?
>
> Thanks in advance.
>
> Cheers,
> Chenghao
>

Reply via email to