nuno-faria commented on issue #15582:
URL: https://github.com/apache/datafusion/issues/15582#issuecomment-3128489298
@alamb is this still something that would benefit upstream Datafusion?
I've implemented the caching of Parquet metadata after noticing that a large
amount of time spent on our workloads comes from (re)reading it, as pointed by
`metadata_load_time`. This time becomes noticeable in large files that have a
large number of pages, where most of the metadata reading times comes from
retrieving the page indexes.
Here are some results:
- Simple reads
```
+----------------------------------------------------+
| 1k simple queries (select where k = ...) |
+-----------+------------+---------+-----------------+
| # of Rows | Not cached | Cached | |
+-----------+------------+---------+-----------------+
| 100k | 1.9447s | 1.1773s | 1.6517x faster |
+-----------+------------+---------+-----------------+
| 100M | 21.2953s | 1.6018s | 13.2943x faster |
+-----------+------------+---------+-----------------+
```
- Partial explain of a `DataSourceExec`:
```
# not cached
metrics=[
bytes_scanned=938381
metadata_load_time=12.697101ms
time_elapsed_opening=13.0083ms
time_elapsed_processing=13.1382ms
...
]
# cached
metrics=[
bytes_scanned=22488
metadata_load_time=35.801µs
time_elapsed_opening=243.4µs
time_elapsed_processing=817.2µs
...
]
```
- TPC-H
- As for TPC-H, the improvements won't be as noticeable given the queries
are more complex, but they will be there. Here are the `metadata_load_times` of
a TPC-H query (`sf=10`):
```
# not cached
[nation] metadata_load_time=338.501µs
[orders] metadata_load_time=95.710812ms
[supplier] metadata_load_time=1.478501ms
[lineitem] metadata_load_time=424.488412ms
[lineitem] metadata_load_time=383.579612ms
[lineitem] metadata_load_time=461.110012ms
# cached
[nation] metadata_load_time=76.001µs
[orders] metadata_load_time=15.823212ms
[supplier] metadata_load_time=49.201µs
[lineitem] metadata_load_time=906.312µs
[lineitem] metadata_load_time=6.372212ms
[lineitem] metadata_load_time=1.617712ms
```
As for the implementation, it relies on the session's `CacheManager` and a
custom `CachedParquetFileReaderFactory`. The metadata caching can be set
independently for each Parquet file, using `ParquetReadOptions`, and is
invalidated when the underlying file is modified. I would be happy to open a
PR, but I still have some questions:
- For the `CacheManager` to remain generic, I created a `pub type
FileMetadata = dyn Any + Send + Sync;` to represent metadata, which essentially
can end up storing anything. Unlike the other information stored (`Statistics`
and `ObjectMeta`), there isn't a common type for embedded metadata. Is this the
right approach, or should the `CacheManager` be aware of `ParquetMetaData`?
- Unlike the other `CacheManager` parameters, which I believe are
exclusively user-provided, I think it would make sense for the metadata cache
be populated with a `DefaultFilesMetadataCache`, so its easier to enable
caching just with `ParquetReadOptions` or `set ...`. Does this make sense?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]