Kontinuation commented on PR #744:
URL: https://github.com/apache/sedona/pull/744#issuecomment-1376763045

   > @Kontinuation Quick follow-up question: since Sedona already offers 
ST_GeoHash 
(https://sedona.apache.org/1.3.1-incubating/api/flink/Function/?h=geohash#st_geohash),
 I believe one generate a GeoHash for each geometry, sort all geoms by GeoHash, 
then save the DataFrame in a GeoParquet. This way, the pruning power of 
GeoParquet will be much stronger given smaller bboxs in each Parquet partition?
   
   Yes. The data has to be clustered by spatial proximity to maximize the 
efficiency of spatial predicate push-down. Sorting the data using GeoHash or 
space-filling curves is a good approach to reorganizing the dataset for query 
efficiency.
   
   The following figures are results of sorting geoms by high precision GeoHash 
values, writing data as GeoParquet files and finally plotting the `bbox`es of 
the GeoParquet files. Though the partitionings are not perfect, they still 
provide an opportunity for pruning some files depending on the query window.
   
   | AREALM (50 files) | AREALM (200 files) |
   |-------------|-----------|
   
|![arealm_50_files](https://user-images.githubusercontent.com/5501374/211470460-869a3581-4673-45ef-ae31-d3346c6c9271.png)
  | 
![arealm_200_files](https://user-images.githubusercontent.com/5501374/211470626-aebf5fed-349b-43b9-a6e3-71b6d9e5960e.png)
 |
   
   | MS Buildings NYC (50 files) | MS Buildings NYC (200 files) |
   |-------------|-----------|
   
|![ms_buildings_nyc](https://user-images.githubusercontent.com/5501374/211470420-64acbdb7-d398-40c6-a649-7bbafac53837.png)
  | 
![ms_buildings_nyc_200](https://user-images.githubusercontent.com/5501374/211470721-c2176af4-f4b8-4de8-bf4d-77804a744954.png)
 |
   
   The [Spatial Parquet paper](https://arxiv.org/pdf/2209.02158.pdf) has also 
experimented with the effect of sorting the data using Z curve and Hilbert 
curve. Their approach works for file-level data pruning as well, though the 
paper primarily deals with row group/page-level data skipping.
   
   Another approach is to partition the dataset by low-precision GeoHash 
values. It produces non-overlapping bounding boxes for point datasets. I've 
taken this approach when creating examples in the documentation. Carto's 
analytics toolbox for Databricks supports [partitioning the dataset by Z2 index 
of tile 
coordinates](https://github.com/CartoDB/analytics-toolbox-core/blob/main/clouds/databricks/libraries/scala/core/src/main/scala/com/carto/analyticstoolbox/spark/spatial/OptimizeSpatial.scala#L49-L67),
 which is a similar approach.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to