Re: Depolying stage-level scheduling for Spark SQL

2022-10-03 Thread Tom Graves
1) In my opinion this is to complex for the average user. In this case I'm assuming you have some sort of optimizer that would apply and do it automatically for the user?  If its just in the research stage of things can you just modify Spark to do experiments? 2) I think the main thing is

Re: Depolying stage-level scheduling for Spark SQL

2022-09-30 Thread Chenghao Lyu
Thanks for the clarification Tom! A bit more backgrounds for what we want to do: we have proposed a fine-grained (stage-level) resource optimization approach in VLDB22  https://www.vldb.org/pvldb/vol15/p3098-lyu.pdf and would like to try it over Spark. Our approach can recommend the resource

Re: Depolying stage-level scheduling for Spark SQL

2022-09-30 Thread Tom Graves
see the original SPIP for as to why we only support RDD:  https://issues.apache.org/jira/browse/SPARK-27495 The main problem is exactly what you are referring to. The RDD level is not exposed to the user when using SQL or Dataframe API. This is on purpose and user shouldn't have to know

Re: Depolying stage-level scheduling for Spark SQL

2022-09-30 Thread Chenghao Lyu
Thanks for the reply! To clarify, for issue 2, it could still break apart a query into multiple jobs without AQE — I have turned off the AQE in my posted example. For 1, an end user just needs to turn on/off a knob to use the stage-level scheduling for Spark SQL — I am considering adding a

Re: Depolying stage-level scheduling for Spark SQL

2022-09-29 Thread Herman van Hovell
I think issue 2 is caused by adaptive query execution. This will break apart queries into multiple jobs, each subsequent job will generate a RDD that is based on previous ones. As for 1. I am not sure how much you want to expose to an end user here. SQL is declarative, and it does not specify how

Depolying stage-level scheduling for Spark SQL

2022-09-29 Thread Chenghao Lyu
Hi, I plan to deploy the stage-level scheduling for Spark SQL to apply some fine-grained optimizations over the DAG of stages. However, I am blocked by the following issues: 1. The current stage-level scheduling supports RDD APIs only. So is there a way to reuse the stage-level scheduling for