Re: [PR] Blog: Optimizing SQL and DataFrames [datafusion-site]
alamb commented on PR #74: URL: https://github.com/apache/datafusion-site/pull/74#issuecomment-2975930224 Blogs are posted: - https://datafusion.staged.apache.org/blog/2025/06/15/optimizing-sql-dataframes-part-one/ - https://datafusion.staged.apache.org/blog/2025/06/15/optimizing-sql-dataframes-part-two/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Blog: Optimizing SQL and DataFrames [datafusion-site]
alamb merged PR #74: URL: https://github.com/apache/datafusion-site/pull/74 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Blog: Optimizing SQL and DataFrames [datafusion-site]
alamb commented on PR #74: URL: https://github.com/apache/datafusion-site/pull/74#issuecomment-2975871112 Thanks again @timsaucer @akurmustafa and @kevinjqliu -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Blog: Optimizing SQL and DataFrames [datafusion-site]
alamb commented on code in PR #74: URL: https://github.com/apache/datafusion-site/pull/74#discussion_r2145976359 ## content/blog/2025-06-15-optimizing-sql-dataframes-part-one.md: ## @@ -0,0 +1,250 @@ +--- +layout: post +title: Optimizing SQL (and DataFrames) in DataFusion, Part 1: Query Optimization Overview +date: 2025-06-15 +author: alamb, akurmustafa +categories: [core] +--- + + + + + +*Note: this blog was originally published [on the InfluxData blog](https://www.influxdata.com/blog/optimizing-sql-dataframes-part-one/)* + + +## Introduction + +Sometimes Query Optimizers are seen as a sort of black magic, [“the most +challenging problem in computer +science,”](https://15799.courses.cs.cmu.edu/spring2025/) according to Father +Pavlo, or some behind-the-scenes player. We believe this perception is because: + + +1. One must implement the rest of a database system (data storage, transactions, + SQL parser, expression evaluation, plan execution, etc.) **before** the + optimizer becomes critical[5](#footnote5). + +2. Some parts of the optimizer are tightly tied to the rest of the system (e.g., + storage or indexes), so many classic optimizers are described with + system-specific terminology. + +3. Some optimizer tasks, such as access path selection and join order are known + challenges and not yet solved (practically)—maybe they really do require + black magic 🤔. + +However, Query Optimizers are no more complicated in theory or practice than other parts of a database system, as we will argue in a series of posts: + +**Part 1: (this post)**: + +* Review what a Query Optimizer is, what it does, and why you need one for SQL and DataFrames. +* Describe how industrial Query Optimizers are structured and standard optimization classes. + +**Part 2:** + +* Describe the optimization categories with examples and pointers to implementations. +* Describe [Apache DataFusion](https://datafusion.apache.org/)’s rationale and approach to query optimization, specifically for access path and join ordering. + +After reading these blogs, we hope people will use DataFusion to: + +1. Build their own system specific optimizers. +2. Perform practical academic research on optimization (especially researchers + working on new optimizations / join ordering—looking at you [CMU + 15-799](https://15799.courses.cs.cmu.edu/spring2025/), next year). + + +## Query Optimizer Background + +The key pitch for querying databases, and likely the key to the longevity of SQL +(despite people’s love/hate relationship—see [SQL or Death? Seminar Series – +Spring 2025](https://db.cs.cmu.edu/seminar2025/)), is that it disconnects the +`WHAT` you want to compute from the `HOW` to do it. SQL is a *declarative* +language—it describes what answers are desired rather than an *imperative* +language such as Python, where you describe how to do the computation as shown +in Figure 1. + + + +**Figure 1**: Query Execution: Users describe the answer they want using either +SQL or a DataFrame. For SQL, a Query Planner translates the parsed query +into an *initial plan*. The DataFrame API creates an initial plan directly. +The initial plan is correct, but slow. Then, the Query +Optimizer rewrites the initial plan into an *optimized plan*, which computes +the same results but faster and more efficiently. Finally, the Execution Engine +executes the optimized plan producing results. + +## SQL, DataFrames, LogicalPlan Equivalence + +Given their name, it is not surprising that Query Optimizers can improve the +performance of SQL queries. However, it is under-appreciated that this also +applies to DataFrame style APIs. + +Classic DataFrame systems such as [pandas] and [Polars] (by default) execute +eagerly and thus have limited opportunities for optimization. However, more +modern APIs such as [Polar's lazy API], [Apache Spark's DataFrame]. and +[DataFusion's DataFrame] are much faster as they use the design shown in Figure +1 and apply many query optimization techniques. + +[pandas]: https://pandas.pydata.org/ +[Polars]: https://pola.rs/) +[Polar'’'s lazy API]: https://docs.pola.rs/user-guide/lazy/using/ Review Comment: It is an interesting question if `polars` is pllural (`polar's`) or singular (`polars's`). I think the way you are suggesting makes sense -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Blog: Optimizing SQL and DataFrames [datafusion-site]
kevinjqliu commented on PR #74: URL: https://github.com/apache/datafusion-site/pull/74#issuecomment-2967429852 also nit, any links with `https://github.com/apache/datafusion/blob/main/` runs into the risk of being stale at a later time. For example, if a file path was moved to a different location. It would be better to use a specific git hash like ``` [unwrap_cast](https://github.com/apache/datafusion/blob/8f3f70877febaa79be3349875e979d3a6e65c30e/datafusion/optimizer/src/simplify_expressions/unwrap_cast.rs#L70), ``` https://github.com/apache/datafusion-site/pull/74/files#diff-75edab122b58ff4a9a79c03ffa14767046dc8f8df5b511e4e3142cf5e01774edR286 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Blog: Optimizing SQL and DataFrames [datafusion-site]
kevinjqliu commented on code in PR #74: URL: https://github.com/apache/datafusion-site/pull/74#discussion_r2143137504 ## content/blog/2025-06-15-optimizing-sql-dataframes-part-two.md: ## @@ -0,0 +1,533 @@ +--- +layout: post +title: Optimizing SQL (and DataFrames) in DataFusion, Part 2: Optimizers in Apache DataFusion +date: 2025-06-15 +author: alamb, akurmustafa +categories: [core] +--- + + + +*Note, this blog was originally published [on the InfluxData blog].* + +[on the InfluxData blog]: https://www.influxdata.com/blog/optimizing-sql-dataframes-part-two/ + +In the [first part of this post], we discussed what a Query Optimizer is, what +role it plays, and described how industrial optimizers are organized. In this +second post, we describe various optimizations that are found in [Apache +DataFusion](https://datafusion.apache.org/) and other industrial systems in more +detail. + + +DataFusion contains high quality, full-featured implementations for *Always +Optimizations* and *Engine Specific Optimizations* (defined in Part 1). +Optimizers are implemented as rewrites of `LogicalPlan` in the [logical +optimizer](https://github.com/apache/datafusion/tree/main/datafusion/optimizer) +or rewrites of `ExecutionPlan` in the [physical +optimizer](https://github.com/apache/datafusion/tree/main/datafusion/physical-optimizer). +This design means the same optimizer passes are applied for SQL queries, +DataFrame queries, as well as plans for other query language frontends such as +[InfluxQL](https://github.com/influxdata/influxdb3_core/tree/26a30bf8d6e2b6b3f1dd905c4ec27e3db6e20d5f/iox_query_influxql) +in InfluxDB 3.0, +[PromQL](https://github.com/GreptimeTeam/greptimedb/blob/0bd322a078cae4f128b791475ec91149499de33a/src/query/src/promql/planner.rs#L1) +in [Greptime](https://greptime.com/), and +[vega](https://github.com/vega/vegafusion/tree/dc15c1b9fc7d297f12bea919795d58cda1c88fcf/vegafusion-core/src/planning) +in [VegaFusion](https://vegafusion.io/). + + +[first part of this post]: https://datafusion.apache.org/blog/2025/06/15/optimizing-sql-dataframes-part-one + +## Always Optimizations + +Some optimizations are so important they are found in almost all query engines +and are typically the first implemented as they provide the largest cost / +benefit ratio (and performance is terrible without them). + + +### Predicate/Filter Pushdown + +**Why**: Avoid carrying unneeded *rows *as soon as possible + +**What**: Moves filters “down” in the plan so they run earlier during execution, as shown in Figure 1. + +**Example Implementations**: [DataFusion], [DuckDB], [ClickHouse] + +[DataFusion]: https://github.com/apache/datafusion/blob/main/datafusion/optimizer/src/push_down_filter.rs +[DuckDB]: https://github.com/duckdb/duckdb/blob/main/src/optimizer/filter_pushdown.cpp +[ClickHouse]: https://github.com/ClickHouse/ClickHouse/blob/master/src/Processors/QueryPlan/Optimizations/filterPushDown.cpp + +The earlier data is filtered out in the plan, the less work the rest of the plan +has to do. Most mature databases aggressively use filter pushdown / early +filtering combined with techniques such as partition and storage pruning (e.g. +[Parquet Row Group pruning]) for performance. + +[Parquet Row Group pruning]: https://blog.xiangpeng.systems/posts/parquet-to-arrow/ + +An extreme, and somewhat contrived, is the query + +```sql +SELECT city, COUNT(*) FROM population GROUP BY city HAVING city = 'BOSTON'; +``` + +Semantically, `HAVING` is [evaluated after] `GROUP BY` in SQL. However, computing +the population of all cities and discarding everything except Boston is much +slower than only computing the population for Boston and so most Query +Optimizers will evaluate the filter before the aggregation. + +[evaluated after]: https://www.datacamp.com/tutorial/sql-order-of-execution + + + +**Figure 1**: Filter Pushdown. In (**A**) without filter pushdown, the operator +processes more rows, reducing efficiency. In (**B**) with filter pushdown, the +operator receives fewer rows, resulting in less overall work and leading to a +faster and more efficient query. + + +### Projection Pushdown + +**Why**: Avoid carrying unneeded *columns *as soon as possible + +**What: **Pushes “projection” (keeping only certain columns) earlier in the plan, as shown in Figure 2. + +**Example Implementations: **Implementations: [DataFusion](https://github.com/apache/datafusion/blob/main/datafusion/physical-optimizer/src/projection_pushdown.rs), [DuckDB](https://github.com/duckdb/duckdb/blob/a8a6a080c8809d5d4b3c955e9f113574f6f0bfe0/src/optimizer/pushdown/pushdown_projection.cpp), [ClickHouse](https://github.com/ClickHouse/ClickHouse/blob/master/src/Processors/QueryPlan/Optimizations/optimizeUseNormalProjection.cpp) + +Similarly to the motivation for *Filter Pushdown*, the earlier the plan stops +doing something, the less work it does overall and thus the faster it runs. For +Projection Pushdown, if columns are not needed later in
Re: [PR] Blog: Optimizing SQL and DataFrames [datafusion-site]
kevinjqliu commented on code in PR #74: URL: https://github.com/apache/datafusion-site/pull/74#discussion_r2143107153 ## content/blog/2025-06-15-optimizing-sql-dataframes-part-one.md: ## @@ -0,0 +1,250 @@ +--- +layout: post +title: Optimizing SQL (and DataFrames) in DataFusion, Part 1: Query Optimization Overview +date: 2025-06-15 +author: alamb, akurmustafa +categories: [core] +--- + + + + + +*Note: this blog was originally published [on the InfluxData blog](https://www.influxdata.com/blog/optimizing-sql-dataframes-part-one/)* + + +## Introduction + +Sometimes Query Optimizers are seen as a sort of black magic, [“the most +challenging problem in computer +science,”](https://15799.courses.cs.cmu.edu/spring2025/) according to Father +Pavlo, or some behind-the-scenes player. We believe this perception is because: + + +1. One must implement the rest of a database system (data storage, transactions, + SQL parser, expression evaluation, plan execution, etc.) **before** the + optimizer becomes critical[5](#footnote5). + +2. Some parts of the optimizer are tightly tied to the rest of the system (e.g., + storage or indexes), so many classic optimizers are described with + system-specific terminology. + +3. Some optimizer tasks, such as access path selection and join order are known + challenges and not yet solved (practically)—maybe they really do require + black magic 🤔. + +However, Query Optimizers are no more complicated in theory or practice than other parts of a database system, as we will argue in a series of posts: + +**Part 1: (this post)**: + +* Review what a Query Optimizer is, what it does, and why you need one for SQL and DataFrames. +* Describe how industrial Query Optimizers are structured and standard optimization classes. + +**Part 2:** + +* Describe the optimization categories with examples and pointers to implementations. +* Describe [Apache DataFusion](https://datafusion.apache.org/)’s rationale and approach to query optimization, specifically for access path and join ordering. + +After reading these blogs, we hope people will use DataFusion to: + +1. Build their own system specific optimizers. +2. Perform practical academic research on optimization (especially researchers + working on new optimizations / join ordering—looking at you [CMU + 15-799](https://15799.courses.cs.cmu.edu/spring2025/), next year). + + +## Query Optimizer Background + +The key pitch for querying databases, and likely the key to the longevity of SQL +(despite people’s love/hate relationship—see [SQL or Death? Seminar Series – +Spring 2025](https://db.cs.cmu.edu/seminar2025/)), is that it disconnects the +`WHAT` you want to compute from the `HOW` to do it. SQL is a *declarative* +language—it describes what answers are desired rather than an *imperative* +language such as Python, where you describe how to do the computation as shown +in Figure 1. + + + +**Figure 1**: Query Execution: Users describe the answer they want using either +SQL or a DataFrame. For SQL, a Query Planner translates the parsed query +into an *initial plan*. The DataFrame API creates an initial plan directly. +The initial plan is correct, but slow. Then, the Query +Optimizer rewrites the initial plan into an *optimized plan*, which computes +the same results but faster and more efficiently. Finally, the Execution Engine +executes the optimized plan producing results. + +## SQL, DataFrames, LogicalPlan Equivalence + +Given their name, it is not surprising that Query Optimizers can improve the +performance of SQL queries. However, it is under-appreciated that this also +applies to DataFrame style APIs. + +Classic DataFrame systems such as [pandas] and [Polars] (by default) execute +eagerly and thus have limited opportunities for optimization. However, more +modern APIs such as [Polar's lazy API], [Apache Spark's DataFrame]. and +[DataFusion's DataFrame] are much faster as they use the design shown in Figure +1 and apply many query optimization techniques. + +[pandas]: https://pandas.pydata.org/ +[Polars]: https://pola.rs/) Review Comment: ```suggestion [Polars]: https://pola.rs/ ``` ## content/blog/2025-06-15-optimizing-sql-dataframes-part-one.md: ## @@ -0,0 +1,250 @@ +--- +layout: post +title: Optimizing SQL (and DataFrames) in DataFusion, Part 1: Query Optimization Overview +date: 2025-06-15 +author: alamb, akurmustafa +categories: [core] +--- + + + + + +*Note: this blog was originally published [on the InfluxData blog](https://www.influxdata.com/blog/optimizing-sql-dataframes-part-one/)* + + +## Introduction + +Sometimes Query Optimizers are seen as a sort of black magic, [“the most +challenging problem in computer +science,”](https://15799.courses.cs.cmu.edu/spring2025/) according to Father +Pavlo, or some behind-the-scenes player. We believe this perception is because: + + +1. One must implement the rest of a database system (data storage, transactions, + SQL parser, e
Re: [PR] Blog: Optimizing SQL and DataFrames [datafusion-site]
alamb commented on PR #74: URL: https://github.com/apache/datafusion-site/pull/74#issuecomment-2963320742 Thank you @akurmustafa and @timsaucer I hope to publish these posts in a few days if there are no more comments -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Blog: Optimizing SQL and DataFrames [datafusion-site]
alamb commented on code in PR #74: URL: https://github.com/apache/datafusion-site/pull/74#discussion_r2140516503 ## content/blog/2025-06-15-optimizing-sql-dataframes-part-one.md: ## @@ -0,0 +1,249 @@ +--- +layout: post +title: Optimizing SQL (and DataFrames) in DataFusion, Part 1: Query Optimization Overview +date: 2025-06-15 +author: alamb, akurmustafa +categories: [core] +--- + + + + + +*Note: this blog was originally published [on the InfluxData blog](https://www.influxdata.com/blog/optimizing-sql-dataframes-part-one/)* + + +## Introduction + +Sometimes Query Optimizers are seen as a sort of black magic, [“the most +challenging problem in computer +science,”](https://15799.courses.cs.cmu.edu/spring2025/) according to Father +Pavlo, or some behind-the-scenes player. We believe this perception is because: + + +1. One must implement the rest of a database system (data storage, transactions, + SQL parser, expression evaluation, plan execution, etc.) **before** the + optimizer becomes critical[^5]. + +2. Some parts of the optimizer are tightly tied to the rest of the system (e.g., + storage or indexes), so many classic optimizers are described with + system-specific terminology. + +3. Some optimizer tasks, such as access path selection and join order are known + challenges and not yet solved (practically)—maybe they really do require + black magic 🤔. + +However, Query Optimizers are no more complicated in theory or practice than other parts of a database system, as we will argue in a series of posts: + +**Part 1: (this post)**: + +* Review what a Query Optimizer is, what it does, and why you need one for SQL and DataFrames. +* Describe how industrial Query Optimizers are structured and standard optimization classes. + +**Part 2:** + +* Describe the optimization categories with examples and pointers to implementations. +* Describe [Apache DataFusion](https://datafusion.apache.org/)’s rationale and approach to query optimization, specifically for access path and join ordering. + +After reading these blogs, we hope people will use DataFusion to: + +1. Build their own system specific optimizers. +2. Perform practical academic research on optimization (especially researchers + working on new optimizations / join ordering—looking at you [CMU + 15-799](https://15799.courses.cs.cmu.edu/spring2025/), next year). + + +## Query Optimizer Background + +The key pitch for querying databases, and likely the key to the longevity of SQL +(despite people’s love/hate relationship—see [SQL or Death? Seminar Series – +Spring 2025](https://db.cs.cmu.edu/seminar2025/)), is that it disconnects the +`WHAT` you want to compute from the `HOW` to do it. SQL is a *declarative* +language—it describes what answers are desired rather than an *imperative* +language such as Python, where you describe how to do the computation as shown +in Figure 1. + + Review Comment: I think the difference / confusion is that I was using the term "Query Planner" to mean "translate from a SQL parse tree to initial LogicalPlan" -- and from that perspective I think of the DataFrame API as directly building a LogicalPlan (and thus does not need a "planner") -- however I can see how the use of word "planner" is confusing. I pushed a commit to try and clarify it. in ee8b460 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Blog: Optimizing SQL and DataFrames [datafusion-site]
alamb commented on PR #74: URL: https://github.com/apache/datafusion-site/pull/74#issuecomment-2963244337 Thank you @akurmustafa -- fixing the links is super helpful -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Blog: Optimizing SQL and DataFrames [datafusion-site]
akurmustafa commented on PR #74: URL: https://github.com/apache/datafusion-site/pull/74#issuecomment-2957563464 Thanks @timsaucer for the reviews and pointing out the broken links. @alamb I resolved the link issues by resorting html instead of relying on the markdown rendering. I sent the commit directly in this branch. I hope your local repo doesn't have further commits. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Blog: Optimizing SQL and DataFrames [datafusion-site]
timsaucer commented on code in PR #74:
URL: https://github.com/apache/datafusion-site/pull/74#discussion_r2135540165
##
content/blog/2025-06-15-optimizing-sql-dataframes-part-one.md:
##
@@ -0,0 +1,249 @@
+---
+layout: post
+title: Optimizing SQL (and DataFrames) in DataFusion, Part 1: Query
Optimization Overview
+date: 2025-06-15
+author: alamb, akurmustafa
+categories: [core]
+---
+
+
+
+
+
+*Note: this blog was originally published [on the InfluxData
blog](https://www.influxdata.com/blog/optimizing-sql-dataframes-part-one/)*
+
+
+## Introduction
+
+Sometimes Query Optimizers are seen as a sort of black magic, [“the most
+challenging problem in computer
+science,”](https://15799.courses.cs.cmu.edu/spring2025/) according to Father
+Pavlo, or some behind-the-scenes player. We believe this perception is because:
+
+
+1. One must implement the rest of a database system (data storage,
transactions,
+ SQL parser, expression evaluation, plan execution, etc.) **before** the
+ optimizer becomes critical[^5].
+
+2. Some parts of the optimizer are tightly tied to the rest of the system
(e.g.,
+ storage or indexes), so many classic optimizers are described with
+ system-specific terminology.
+
+3. Some optimizer tasks, such as access path selection and join order are known
+ challenges and not yet solved (practically)—maybe they really do require
+ black magic 🤔.
+
+However, Query Optimizers are no more complicated in theory or practice than
other parts of a database system, as we will argue in a series of posts:
+
+**Part 1: (this post)**:
+
+* Review what a Query Optimizer is, what it does, and why you need one for SQL
and DataFrames.
+* Describe how industrial Query Optimizers are structured and standard
optimization classes.
+
+**Part 2:**
+
+* Describe the optimization categories with examples and pointers to
implementations.
+* Describe [Apache DataFusion](https://datafusion.apache.org/)’s rationale and
approach to query optimization, specifically for access path and join ordering.
+
+After reading these blogs, we hope people will use DataFusion to:
+
+1. Build their own system specific optimizers.
+2. Perform practical academic research on optimization (especially researchers
+ working on new optimizations / join ordering—looking at you [CMU
+ 15-799](https://15799.courses.cs.cmu.edu/spring2025/), next year).
+
+
+## Query Optimizer Background
+
+The key pitch for querying databases, and likely the key to the longevity of
SQL
+(despite people’s love/hate relationship—see [SQL or Death? Seminar Series –
+Spring 2025](https://db.cs.cmu.edu/seminar2025/)), is that it disconnects the
+`WHAT` you want to compute from the `HOW` to do it. SQL is a *declarative*
+language—it describes what answers are desired rather than an *imperative*
+language such as Python, where you describe how to do the computation as shown
+in Figure 1.
+
+
+
+**Figure 1**: Query Execution: Users describe the answer they want using either
+a DataFrame or SQL. The query planner or DataFrame API translates that
+description into an *Initial Plan*, which is correct but slow. The Query
+Optimizer then rewrites the initial plan to an *Optimized Plan*, which computes
+the same results but faster and more efficiently. Finally, the Execution Engine
+executes the optimized plan producing results.
+
+## SQL, DataFrames, LogicalPlan Equivalence
+
+Given their name, it is not surprising that Query Optimizers can improve the
+performance of SQL queries. However, it is under-appreciated that this also
+applies to DataFrame style APIs.
+
+Classic DataFrame systems such as [pandas] and [Polars] (by default) execute
+eagerly and thus have limited opportunities for optimization. However, more
+modern APIs such as [Polar's lazy API], [Apache Spark's DataFrame]. and
+[DataFusion's DataFrame] are much faster as they use the design shown in Figure
+1 and apply many query optimization techniques.
+
+[pandas]: https://pandas.pydata.org/
+[Polars]: https://pola.rs/)
+[Polar'’'s lazy API]: https://docs.pola.rs/user-guide/lazy/using/
+[Apache Spark's DataFrame]:
https://spark.apache.org/docs/latest/sql-programming-guide.html#datasets-and-dataframes),
+[DataFusion's DataFrame]:
https://datafusion.apache.org/user-guide/dataframe.html
+
+## Example of Query Optimizer
+
+This section motivates the value of a Query Optimizer with an example. Let’s
say
+you have some observations of animal behavior, as illustrated in Table 1.
+
+
+
+**Table 1**: Example observational data.
+
+If the user wants to know the average population for some species in the last
+month, a user can write a SQL query or a DataFrame such as the following:
+
+SQL:
+
+```sql
+SELECT location, AVG(population)
+FROM observations
+WHERE species = ‘contrarian spider’ AND
+ observation_time >= now() - interval '1 month'
+GROUP BY location
+```
+
+DataFrame:
+
+```rust
+df.scan("observations")
+ .filter(col("species").eq("contrarian spider"))
+ .filter(col("o
