Hi, I plan to deploy the stage-level scheduling for Spark SQL to apply some fine-grained optimizations over the DAG of stages. However, I am blocked by the following issues:
1. The current stage-level scheduling supports RDD APIs only. So is there a way to reuse the stage-level scheduling for Spark SQL? E.g., how to expose the RDD code (the transformations and actions) from a Spark SQL (with SQL syntax)? 2. We do not quite understand why a Spark SQL could trigger multiple jobs, and have some RDDs regenerated, as posted in here. Can anyone give us some insight on the reasons and whether we can avoid the RDD regeneration to save execution time? Thanks in advance. Cheers, Chenghao