paleolimbot opened a new pull request, #646:
URL: https://github.com/apache/sedona-db/pull/646
In #251 we tried to use the file metadata cache and found that it actually
slowed down queries. Hiroaki kindly benchmarked the effect of the cache against
DuckDB to demonstrate that the file cache there is effective for queries
against large tables. @b4l kindly showed how to do this in #604.
This PR pipes through the requisite options to ensure the cache is used for
GeoParquet reads. This is especially important because we need to pull two
extra copies of the metadata after DataFusion has already pulled it: if we
don't use the cached version, we issue three requests where we could have
issued one.
A secondary issue is that the default size of the cache is not well-equiped
to deal with Overture buildings, which we were using to benchmark this. The
buildings data requires almost 900 megabytes of cache space and because it is a
least-recently used cache being queried roughly in order three times, if the
cache size is even a little bit smaller than the full size of the dataset then
it is 0% useful. The increase we see in time is probably because of contention
on the mutex guarding the in-memory cache.
```python
import re
import os
os.environ["AWS_SKIP_SIGNATURE"] = "true"
os.environ["AWS_DEFAULT_REGION"] = "us-west-2"
import sedona.db
sd = sedona.db.connect()
sd.sql("SET datafusion.runtime.metadata_cache_limit = '900M'").execute()
# 16s on main, 10s on this PR with a big enough cache
sd.read_parquet(
"s3://overturemaps-us-west-2/release/2026-02-18.0/theme=buildings/type=building/"
).to_view("buildings", overwrite=True)
# Second time: 16s on main, 0s with this PR
sd.read_parquet(
"s3://overturemaps-us-west-2/release/2026-02-18.0/theme=buildings/type=building/"
).to_view("buildings", overwrite=True)
```
I took the opportunity to redo the Overture buildings documentation page to
include this and a few other improvements we added in the last few months.
Closes #250.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]