Omega359 commented on code in PR #58:
URL: https://github.com/apache/datafusion-site/pull/58#discussion_r1987631033
##########
content/blog/2025-03-05-ordering-analysis.md:
##########
@@ -0,0 +1,353 @@
+---
+layout: post
+title: Analysis of Ordering for Better Plans
+date: 2025-03-05
+author: Mustafa Akur, Andrew Lamb
+categories: [tutorial]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+<!-- see https://github.com/apache/datafusion/issues/11631 for details -->
+
+## Introduction
+In this blog post, we will explore how to determine whether an ordering
requirement of an operator is satisfied by its input data. This analysis is
essential for order-based optimizations and is often more complex than one
might initially think.
+<blockquote style="border-left: 4px solid #007bff; padding: 10px;
background-color: #f8f9fa;">
+ <strong>Ordering Requirement</strong> for an operator refers to the
condition that input data must be sorted in a certain way for the operator to
function as intended. If this condition is not met, the operator may not
perform as expected. It is the job of the planner to make sure that the
requirements - such as specific ordering, specific distribution, etc. - of all
operators are satisfied during execution.
+</blockquote>
+
+There are various use cases, where this type of analysis can be useful.
+### Removing Unnecessary Sorts
+Imagine a user wants to execute the following query:
+```SQL
+SELECT hostname, log_line
+FROM telemetry ORDER BY time ASC limit 10
+```
+If we don't know anything about the `telemetry` table, we need to sort it by
`time ASC` and then retrieve the first 10 rows to get the correct result.
However, if the table is already ordered by `time ASC`, simply retrieving the
first 10 rows is sufficient. This approach executes much faster and uses less
memory compared to the first version.
+
+The key is that the query optimizer needs to know the data is already sorted.
For simple queries that is likely simple, but it gets complicated fast, like
for example, what if your data is sorted by `[hostname, time ASC]` and your
query is
+```sql
+SELECT hostname, log_line
+FROM telemetry WHERE hostname = 'app.example.com' ORDER BY time ASC;
+```
+In this case, the system still doesn't have to do any sorting -- but only if
it has enough analysis to be able to reason about the sortedness of the stream
when we know `hostname` has a single value.
+
+### Optimizing Execution Modes Using Ordering Information
+As another use case, some operators can utilize the ordering information to
change its underlying algorithm to execute more efficiently. Consider the
following query:
+```SQL
+SELECT COUNT(log_line)
+FROM telemetry GROUP BY hostname;
+```
+when `telemetry` is sorted by `hostname`, aggregation doesn't need to hash the
entire data at its input. It can use a much more efficient algorithm for
grouping the data according to the `hostname` values. Failure to detect the
ordering can result in choosing the sub-optimal algorithm variant for the
operator. To see this in practice, check out the
[source](https://github.com/apache/datafusion/tree/main/datafusion/physical-plan/src/aggregates/order)
for ordered variant of the `Aggregation` in `Datafusion`.
Review Comment:
`input is sorted corectly` ... -> input is sorted correctly
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]