Thank you guys for your responses and what you said are all valid points and great advices. As suggested by the JIRA ticket CALCITE-1737, the goal of my GSOC proposal will be to support Spark's DataFrame/DataSet API in Apache Calcite. However, this task makes less sense to me at this moment as the purpose of DataFrame/DataSet API is to have a high level interface and perform query optimization to obtain an efficient physical plan. This is exactly what Calcite with Spark Adaptor has performed. Therefore, supporting DataFrame/DataSet API in Calcite is essentially asking to chain two optimizers together while they are supposed to be two optimizers you can choose from. Also, as Alessandro mentioned, gluing two planners might give undesired outcomes in reality. Hence, I really have a tough time putting together a valid proposal for this GSOC project due to this issue. I will greatly appreciate it if anyone can give out some clarifications regarding this JIRA ticket.
Again, I am deeply grateful for your replies and your helpfulness. Best, Linan Zheng On Sat, Mar 17, 2018 at 3:40 AM, Alessandro Solimando < [email protected]> wrote: > In my experience, if the "native" optimizer cannot be turned off, it can > "revert back" some optimizations when you submit your "optimized" > program/SQL query to the engine. > > As Spark 2.X is concerned, I am not aware of any way to turn catalyst off, > so if you have a different cost model and/or query planner you might easily > end up with a different logical and/or physical plan than what you expect. > > In the "Calcite performance benchmark" discussion, started by Edmon > Begoli, this > fact is addressed, as he proposed to evaluate Calcite with/without the > "native" optimizer, which makes a lot of sense to me and can lead to > surprising results. > > My knowledge of catalyst internals is unfortunately pretty shallow, so I > cannot tell to which extent this can be an issue, or if potential problems > can be by-passed by using HINTS or similar techniques. > > If anyone knows more or have practical examples on the subject I would be > very interested in hearing more. > > Best regards, > Alessandro > > On 16 March 2018 at 22:35, Julian Hyde <[email protected]> wrote: > > > The purpose of Calcite’s Spark Adapter is to circumvent Spark SQL and > > Catalyst entirely. Calcite parses the SQL, it optimizes it to create a > > physical plan that uses Spark relational operators, then converts that > plan > > to a Spark program. > > > > If you want to use Spark SQL and Catalyst that’s totally fine, but don’t > > use Calcite for those cases. > > > > Julian > > > > > > > On Mar 16, 2018, at 11:44 AM, Linan Zheng <[email protected]> wrote: > > > > > > Hi Everyone, > > > > > > My name is Linan Zheng and currently a senior CS student at Boston > > > University. I am fascinated by the idea of adding Apache Spark's > > > DataFrame/DataSet API support in Apache Calcite. Right now I am working > > on > > > the proposal which i hope that I can get some advice with. My question > is > > > that since Spark has implement the Catalyst query optimizer in its > Spark > > > SQL, how should I approach Catalyst's planning rules(logical and > > physical)? > > > And who should be in charge of the query optimization? Any advice and > > > corrections will be much appreciated and thank you guys for reading > this > > > email. > > > > > > -- > > > Best Regard, > > > Linan Zheng > > > > > -- Best Regard, Linan Zheng
