Even Rouault kirjoitti 18.1.2026 klo 16.50:
Ari,
I need to read from a large Parquet file (10-20 GB, in S3) features
using a set of user defined constraints that I can parse into
non-spatial SQL and polygon masks. My tests so far show good
performance with a single non-spatial constraint and (separately)
with a bbox.
Do you mean you get bad performance when setting both
SetAttributeFilter() and SetSpatialFilter[Rect]() ? I cannot explain
that. Combining them should not be less performant.
No, I'm, just looking for how to best mix spatial and non-spatial
filters/constraints when retrieving features from a Paquet file using GDAL.
You don't mention if your geoparquet files have a covering bounding
box column. For the default WKB encoding, this is essential to avoid
full scan of the file.
I don't know about that - will check - but the basic
SetSpatialFilterRect on a GDAL Python layer works fine.
However, I not sure how to go forward with mixing non-spatial
constraints and perhaps multiple arbitrary polygons (which may be
non-adjacent).
If you have something like attr_filter && (Intersects(geom, poly1) ||
Intersects(geom, poly2)) , then you should do separately attr_filter
&& Intersects(geom, poly1) and then attr_filter && Intersects(geom,
poly2)
Ok, so the attr_filter is not expensive even though it is applied twice.
GDAL SQL docs tell me that with Spatialite built-in I could use
ST_Intersects but does that help with Parquet files?
No, because that wouldn't translate as a SetSpatialFilter[Rect]()
request, and thus you would get full scan of the file
Ok, I assumed that too.
How about constructing the non-spatial SQL query first, use that on
dataset, and then use SetSpatialFilterRect on the resulting layer
object possibly multiple times plus ogr.Geometry.Intersects on each
feature coming from the obtained layer? My intuition would tell me to
first do the spatial filtering as that (may) narrow down the search
considerably. But then I cannot use the non-spatial SQL as that
requires a dataset to be executed on.
You could store the result of the spatial request in a temporary
dataset (possibly in memory) and then apply the attribute filter. But
as said above, I'm a bit surprised that combining the attribute filter
and a (single geometry) spatial filter isn't efficient.
Maybe I was not clear on that I'm at this point wondering how to best
combine the attribute filter and the spatial filter.
Instead of the Parquet driver, you may also try with duckdb and the
ADBC driver. The duckdb SQL engine generally outperforms
libarrow/libparquet.
Hm, Parquet files are given at this point - I'm doing
consultancy/development for a client and Parquet is their choice so I
guess I have developer role now. :)
Even
Thanks,
Ari
_______________________________________________
gdal-dev mailing list
[email protected]
https://lists.osgeo.org/mailman/listinfo/gdal-dev