BlakeOrth commented on issue #17211:
URL: https://github.com/apache/datafusion/issues/17211#issuecomment-3272025207
Ok, I've done some data collection using the in progress instrumented object
store that illustrates some differences in request patterns. I'll share and go
over the entire output, including the setup so it can be reproduced. Note that
all of the following sections were run in a single `datafusion-cli` session,
I'm just breaking it up a bit to make it easier to understand and discuss. One
item to note is I'm running a `debug` build instead of `release-nonlto` because
I think we're mostly interested in the IO operations, not necessarily whether
or not they're running at optimized speeds.
## Analysis
### Table Setup
```sql
DataFusion CLI v50.0.0
> set datafusion.execution.listing_table_factory_infer_partitions=false;
0 row(s) fetched.
Elapsed 0.002 seconds.
> \object_store_profiling summary
ObjectStore Profile mode set to Summary
> CREATE EXTERNAL TABLE overture_flat
STORED AS PARQUET LOCATION
's3://overturemaps-us-west-2/release/2025-07-23.0/';
0 row(s) fetched.
Elapsed 34.223 seconds.
Object Store Profiling
List Summary:
count: 1
List Summary:
count: 1
Get Summary:
count: 1536
duration min: 0.045644s
duration max: 0.593886s
duration avg: 0.101914s
size min: 8 B
size max: 1328116 B
size avg: 329311 B
size sum: 505822728 B
> set datafusion.execution.listing_table_factory_infer_partitions=true;
0 row(s) fetched.
Elapsed 0.001 seconds.
Object Store Profiling
> CREATE EXTERNAL TABLE overture_partitioned
STORED AS PARQUET LOCATION
's3://overturemaps-us-west-2/release/2025-07-23.0/';
0 row(s) fetched.
Elapsed 35.500 seconds.
Object Store Profiling
List Summary:
count: 1
List Summary:
count: 3
Get Summary:
count: 1536
duration min: 0.042258s
duration max: 0.314563s
duration avg: 0.100344s
size min: 8 B
size max: 1328116 B
size avg: 329311 B
size sum: 505822728 B
```
This is the test case setup. In order to keep things as fair as possible, we
can set up two tables using the same underlying data. One of the tables is a
flat table with no partitions, the other lets the ListingTable operate on the
partitions defined by the paths. Initially I'm using the summary mode for
profiling due to the size of the underlying data, but even here we can already
see some discrepancies between the flat and partitioned table emerge. In spite
of the same number of `GET` requests, the partitioned table executes 2
additional `LIST` operations.
### First queries
```sql
> select count(*) from overture_flat;
+------------+
| count(*) |
+------------+
| 4198576839 |
+------------+
1 row(s) fetched.
Elapsed 36.787 seconds.
Object Store Profiling
List Summary:
count: 1
Get Summary:
count: 1536
duration min: 0.044221s
duration max: 0.390369s
duration avg: 0.102023s
size min: 8 B
size max: 1328116 B
size avg: 329311 B
size sum: 505822728 B
> select count(*) from overture_partitioned;
+------------+
| count(*) |
+------------+
| 4198576839 |
+------------+
1 row(s) fetched.
Elapsed 37.422 seconds.
Object Store Profiling
List Summary:
count: 22
duration min: 0.042043s
duration max: 0.148611s
duration avg: 0.096775s
Get Summary:
count: 1536
duration min: 0.045364s
duration max: 0.375742s
duration avg: 0.099633s
size min: 8 B
size max: 1328116 B
size avg: 329311 B
size sum: 505822728 B
```
These first queries really begin to show discrepancies. In spite of the
`GET` requests being identical between the two queries, the partitioned table
executes 22 `LIST` operations compared to the flat table's 1 `LIST` operation.
Curiously, the `GET` requests appear to be identical between the table creation
above and the first queries. It merits additional investigation, but the fact
the minimum size for the `GET` is `8`, I suspect this is fetching parquet
footers and subsequently the metadata. This may point to an opportunity to seed
the metadata cache earlier, which could improve performance for the first query
to the `ListingTable` in general.
### Subsequent Queries (cached metadata)
```sql
> \object_store_profiling trace
ObjectStore Profile mode set to Trace
> select count(*) from overture_flat;
+------------+
| count(*) |
+------------+
| 4198576839 |
+------------+
1 row(s) fetched.
Elapsed 0.791 seconds.
Object Store Profiling
2025-09-09T17:53:07.108517313+00:00 operation=List path=release/2025-07-23.0
List Summary:
count: 1
> select count(*) from overture_partitioned;
+------------+
| count(*) |
+------------+
| 4198576839 |
+------------+
1 row(s) fetched.
Elapsed 0.943 seconds.
Object Store Profiling
2025-09-09T17:53:17.991269778+00:00 operation=List duration=0.142341s
path=release/2025-07-23.0
2025-09-09T17:53:18.133630478+00:00 operation=List duration=0.042730s
path=release/2025-07-23.0/theme=addresses
2025-09-09T17:53:18.133986930+00:00 operation=List duration=0.084354s
path=release/2025-07-23.0/theme=base
2025-09-09T17:53:18.134506596+00:00 operation=List duration=0.109531s
path=release/2025-07-23.0/theme=divisions
2025-09-09T17:53:18.218349909+00:00 operation=List duration=0.042024s
path=release/2025-07-23.0/theme=base/type=bathymetry
2025-09-09T17:53:18.134958553+00:00 operation=List duration=0.128599s
path=release/2025-07-23.0/theme=transportation
2025-09-09T17:53:18.134252722+00:00 operation=List duration=0.131287s
path=release/2025-07-23.0/theme=buildings
2025-09-09T17:53:18.134744436+00:00 operation=List duration=0.135978s
path=release/2025-07-23.0/theme=places
2025-09-09T17:53:18.176435159+00:00 operation=List duration=0.097157s
path=release/2025-07-23.0/theme=addresses/type=address
2025-09-09T17:53:18.218771771+00:00 operation=List duration=0.075236s
path=release/2025-07-23.0/theme=base/type=land
2025-09-09T17:53:18.218580227+00:00 operation=List duration=0.086389s
path=release/2025-07-23.0/theme=base/type=infrastructure
2025-09-09T17:53:18.219139311+00:00 operation=List duration=0.087766s
path=release/2025-07-23.0/theme=base/type=land_use
2025-09-09T17:53:18.219346240+00:00 operation=List duration=0.089985s
path=release/2025-07-23.0/theme=base/type=water
2025-09-09T17:53:18.244078850+00:00 operation=List duration=0.070198s
path=release/2025-07-23.0/theme=divisions/type=division
2025-09-09T17:53:18.244290215+00:00 operation=List duration=0.072455s
path=release/2025-07-23.0/theme=divisions/type=division_area
2025-09-09T17:53:18.265775691+00:00 operation=List duration=0.071635s
path=release/2025-07-23.0/theme=buildings/type=building_part
2025-09-09T17:53:18.218957344+00:00 operation=List duration=0.124570s
path=release/2025-07-23.0/theme=base/type=land_cover
2025-09-09T17:53:18.263595467+00:00 operation=List duration=0.091649s
path=release/2025-07-23.0/theme=transportation/type=connector
2025-09-09T17:53:18.244510237+00:00 operation=List duration=0.115401s
path=release/2025-07-23.0/theme=divisions/type=division_boundary
2025-09-09T17:53:18.270813563+00:00 operation=List duration=0.093048s
path=release/2025-07-23.0/theme=places/type=place
2025-09-09T17:53:18.263816680+00:00 operation=List duration=0.120937s
path=release/2025-07-23.0/theme=transportation/type=segment
2025-09-09T17:53:18.265574453+00:00 operation=List duration=0.179032s
path=release/2025-07-23.0/theme=buildings/type=building
List Summary:
count: 22
duration min: 0.042024s
duration max: 0.179032s
duration avg: 0.099650s
```
These final queries have been switched over to `trace` for the profiling now
that metadata has been cached and the output won't be dominated by over 1500
`GET` requests. The difference in operation between the two tables is pretty
clear here. Curiously, the `LIST` operations for the partitioned table are
being emitted with a `duration`, which implies it's actually running
`list_with_delimiter()` which has allowed the instrumented object store to
collect accurate duration data!
Paying close attention to the timestamps here we can see the breadth first
search listing strategy used on partitioned tables in action. First the root
path is listed, blocking until the result is returned:
```
2025-09-09T17:53:17.991269778+00:00 operation=List duration=0.142341s
path=release/2025-07-23.0
```
$`17:53:17.991269778 + 0.142341s = 17:53:18.133610778`$
Then, the first level of the table is listed (this is the partition of
`theme`). This set of requests are executed in parallel, but block their
respective next level in the search. To keep things brief I'll just show the
math for `theme=addresses`.
```
2025-09-09T17:53:18.133630478+00:00 operation=List duration=0.042730s
path=release/2025-07-23.0/theme=addresses
2025-09-09T17:53:18.133986930+00:00 operation=List duration=0.084354s
path=release/2025-07-23.0/theme=base
2025-09-09T17:53:18.134506596+00:00 operation=List duration=0.109531s
path=release/2025-07-23.0/theme=divisions
2025-09-09T17:53:18.134958553+00:00 operation=List duration=0.128599s
path=release/2025-07-23.0/theme=transportation
2025-09-09T17:53:18.134252722+00:00 operation=List duration=0.131287s
path=release/2025-07-23.0/theme=buildings
2025-09-09T17:53:18.134744436+00:00 operation=List duration=0.135978s
path=release/2025-07-23.0/theme=places
```
$`17:53:18.133630478 + 0.042730s = 17:53:18.176360478`$
Finally the last set of directories get listed as their respective upstream
list requests complete
```
2025-09-09T17:53:18.176435159+00:00 operation=List duration=0.097157s
path=release/2025-07-23.0/theme=addresses/type=address
2025-09-09T17:53:18.218349909+00:00 operation=List duration=0.042024s
path=release/2025-07-23.0/theme=base/type=bathymetry
2025-09-09T17:53:18.218771771+00:00 operation=List duration=0.075236s
path=release/2025-07-23.0/theme=base/type=land
2025-09-09T17:53:18.218580227+00:00 operation=List duration=0.086389s
path=release/2025-07-23.0/theme=base/type=infrastructure
2025-09-09T17:53:18.219139311+00:00 operation=List duration=0.087766s
path=release/2025-07-23.0/theme=base/type=land_use
2025-09-09T17:53:18.219346240+00:00 operation=List duration=0.089985s
path=release/2025-07-23.0/theme=base/type=water
2025-09-09T17:53:18.218957344+00:00 operation=List duration=0.124570s
path=release/2025-07-23.0/theme=base/type=land_cover
... etc
```
Additionally, comparing the first `LIST` timestamp to the tail `LIST`
timestamp and how long it took to complete, we can see that we spent 0.453s
executing just listing operations of our total 0.943s query execution time.
$`(17:53:18.265574453 + 0.179032s) - 17:53:17.991269778 = 0.453336675`$
##
@alamb Given the above analysis, in the spirit of talking about this in
terms of IO reduction, normalizing the IO pattern between partitioned and flat
tables in this specific case would eliminate 21 `LIST` calls. Unfortunately,
it's difficult to talk about this in absolutes because the number of `LIST`
calls for partitioned tables depends on the table structure (and actually on
the query run as well, but I think this initial analysis is already long
enough). This data also suggests that if a `ListFilesCache` were to be
implemented and enabled it could eliminate all the IO for simple queries that
can be computed entirely from metadata.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]