Depolying stage-level scheduling for Spark SQL

Chenghao Lyu Thu, 29 Sep 2022 05:56:10 -0700

Hi,

I plan to deploy the stage-level scheduling for Spark SQL to apply some 
fine-grained optimizations over the DAG of stages. However, I am blocked by the 
following issues:


1. The current stage-level scheduling supports RDD APIs only. So is there a way 
to reuse the stage-level scheduling for Spark SQL? E.g., how to expose the RDD 
code (the transformations and actions) from a Spark SQL (with SQL syntax)?
2. We do not quite understand why a Spark SQL could trigger multiple jobs, and 
have some RDDs regenerated, as posted in here. Can anyone give us some insight 
on the reasons and whether we can avoid the RDD regeneration to save execution 
time?

Thanks in advance.

Cheers,
Chenghao

Depolying stage-level scheduling for Spark SQL

Reply via email to