gene-bordegaray commented on code in PR #127:
URL: https://github.com/apache/datafusion-site/pull/127#discussion_r2606332693
##########
content/blog/2025-12-07-avoid-consecutive-repartitions.md:
##########
@@ -0,0 +1,428 @@
+---
+layout: post
+title: A Noob's Guide to Databases
+date: 2025-12-07
+author: Gene Bordegaray, Nga Tran, Andrew Lamb
+categories: [tutorial]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+<div style="display: flex; align-items: center; gap: 20px; margin-bottom:
20px;">
+<div style="flex: 1;">
+
+Databases are some of the most complex yet interesting pieces of software.
They are amazing pieces of abstraction: query engines optimize and execute
complex plans, storage engines provide sophisticated infrastructure as the
backbone of the system, while intricate file formats lay the groundwork for
particular workloads. All of this is exposed by a user-friendly interface and
query languages (typically a dialect of SQL).
+<br><br>
+Starting a journey learning about database internals can be daunting. With so
many topics that are whole PhD degrees themselves, finding a place to start is
difficult. In this blog post, I will share my early journey in the database
world and a quick lesson on one of the first topics I dove into. If you are new
to the space, this post will help you get your first foot into the database
world, and if you are already a veteran, you may still learn something new.
+
+</div>
+<div style="flex: 0 0 40%; text-align: center;">
+<img
+ src="/blog/images/avoid-consecutive-repartitions/database_system_diagram.png"
+ width="100%"
+ class="img-responsive"
+ alt="Database System Components"
+/>
+</div>
+</div>
+
+---
+
+## **Who Am I?**
+
+I am Gene Bordegaray ([LinkedIn](https://www.linkedin.com/in/genebordegaray),
[GitHub](https://github.com/gene-bordegaray)), a recent computer science
graduate from UCLA and software engineer at Datadog. Before starting my job, I
had no real exposure to databases, only enough SQL knowledge to send CRUD
requests and choose between a relational or no-SQL model in a systems design
interview.
+
+When I found out I would be on a team focusing on query engines and execution,
I was excited but horrified. "Query engines?" From my experience, I typed SQL
queries into pgAdmin and got responses without knowing the dark magic that
happened under the hood.
+
+With what seemed like an impossible task at hand, I began my favorite few
months of learning.
+
+---
+
+## **Starting Out**
+
+I am no expert in databases or any of their subsystems, but I am someone who
recently began learning about them. These are some tips I find useful when
first starting.
+
+### Build a Foundation
+
+The first thing I did, which I highly recommend, was watch Andy Pavlo's [Intro
To Database Systems course](https://15445.courses.cs.cmu.edu/fall2025/). This
laid a great foundation for understanding how a database works from end to end
at a high level. It touches on topics ranging from file formats to query
optimization, and it was helpful to have a general context for the whole system
before diving deep into a single sector.
+
+### Narrow Your Scope
+
+The next crucial step is to pick your niche and stick to it. Database systems
are so vast that trying to tackle the whole beast at once is a lost cause. If
you want to effectively contribute to this space, you need to deeply understand
the system you are working on, and you will have much better luck narrowing
your scope.
+
+When learning about the entire database stack at a high level, note what parts
stick out as particularly interesting. For me, this focus is on query engines,
more specifically, the physical planner and optimizer.
+
+### A "Slow" Start
+
+The final piece of advice when starting, and I sound like a broken record, is
to take your time to learn. This is not an easy sector of software to jump
into; it will pay dividends to slow down, fully understand the system, and why
it is designed the way it is.
+
+When making your first contributions to an open-source project, start very
small but go as deep as you can. Don't leave any stone unturned. I did this by
looking for simpler issues, such as formatting or simple bug fixes, and
stepping through the entire data flow that relates to the issue, noting what
each component is responsible for.
+
+This will give you familiarity with the codebase and using your tools, like
your debugger, within the project.
+
+<div class="text-center">
+<img
+ src="/blog/images/avoid-consecutive-repartitions/noot_noot_database_meme.png"
+ width="50%"
+ class="img-responsive"
+ alt="Noot Noot Database Meme"
+/>
+</div>
+<br>
+
+Now that we have some general knowledge of database internals, a niche or
subsystem we want to dive deeper into, and the mindset for acquiring knowledge
before contributing, let's start with our first core issue.
+
+---
+
+## **Intro to Datafusion**
+
+As mentioned, the database subsystem I decided to explore was query engines.
The query engine is responsible for interpreting, optimizing, and executing
queries, aiming to do so as efficiently as possible.
+
+My team was in full-swing of restructuring how query execution would work in
our organization. The team decided we would use [Apache
Datafusion](https://datafusion.apache.org/) at the heart of our system, chosen
for its blazing fast execution time for analytical workloads and vast
extendability. Datafusion is written in Rust and builds on top of [Apache
Arrow](https://arrow.apache.org/) (another great project), a columnar memory
format that enables it to efficiently process large volumes of data in memory.
+
+This project offered a perfect environment for my first steps into databases:
clear, production-ready Rust programming, a manageable codebase, high
performance for a specific use case, and a welcoming community.
+
+### Parallel Execution in Datafusion
+
+<div style="display: flex; align-items: top; gap: 20px; margin-bottom: 20px;">
+<div style="flex: 1;">
+
+Before discussing this issue, it is essential to understand how Datafusion
handles parallel execution.
+<br><br>
+Datafusion implements a vectorized <a
href="https://dl.acm.org/doi/10.1145/93605.98720">Volcano Model</a>, similar to
other state of the art engines such as ClickHouse. The Volcano Model is built
on the idea that each operation is abstracted into an operator, and a DAG can
represent an entire query. Each operator implements a next() function that
returns a batch of tuples or a NULL marker if no data is available.
+<br><br>
+Datafusion achieves multi-core parallelism through the use of "exchange
operators." Individual operators are implemented to use a single CPU core, and
the RepartitionExec operator is responsible for distributing work across
multiple processors.
+
+</div>
+<div style="flex: 0 0 40%; text-align: center;">
+<img
+ src="/blog/images/avoid-consecutive-repartitions/volcano_model_diagram.png"
+ width="100%"
+ class="img-responsive"
+ alt="Vectorized Volcano Model Example"
+/>
+</div>
+</div>
+
+### What is Repartitioning?
+
+Partitioning is a "divide-and-conquer" approach to executing a query. Each
partition is a subset of the data that is being processed on a single core.
Repartitioning is an operation that redistributes data across different
partitions to balance workloads, reduce data skew, and increase parallelism.
Two repartitioning methods are used in Datafusion: round-robin and hash.
+
+#### **Round-Robin Repartitioning**
+
+<div style="display: flex; align-items: top; gap: 20px; margin-bottom: 20px;">
+<div style="flex: 1;">
+
+Round-robin repartitioning is the simplest partitioning strategy. Incoming
data is processed in batches (chunks of rows), and these batches are
distributed across partitions cyclically or sequentially, with each new batch
assigned to the next available partition.
+<br><br>
+Round-robin repartitioning is useful when the data grouping isn't known or
when aiming for an even distribution across partitions. Because it simply
assigns batches in order without inspecting their contents, it is a
low-overhead way to increase parallelism for downstream operations.
+
+</div>
+<div style="flex: 0 0 25%; text-align: center;">
+<img
+
src="/blog/images/avoid-consecutive-repartitions/round_robin_repartitioning.png"
+ width="100%"
+ class="img-responsive"
+ alt="Round-Robin Repartitioning"
+/>
+</div>
+</div>
+
+#### **Hash Repartitioning**
+
+<div style="display: flex; align-items: top; gap: 20px; margin-bottom: 20px;">
+<div style="flex: 1;">
+
+Hash repartitioning distributes data based on a hash function applied to one
or more columns, called the partitioning key. Rows with the same hash value are
placed in the same partition.
+<br><br>
+Hash repartitioning is useful when working with grouped data. Imagine you have
a database containing information on company sales, and you are looking to find
the total revenue each store produced. Hash repartitioning would make this
query much more efficient. Rather than iterating over the data on a single
thread and keeping a running sum for each store, it would be better to hash
repartition on the store column and have multiple threads calculate individual
store sales.
Review Comment:
Thank you 👍
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]