Re: [I] Preserve Spatial Partitioning From RDD to Dataframe [sedona]

via GitHub Wed, 06 Mar 2024 06:59:53 -0800


jwass commented on issue #1268:
URL: https://github.com/apache/sedona/issues/1268#issuecomment-1981054310


   > @jwass Is there a reason why you want to use the Sedona rdd-based spatial 
partitioning? This is considered as low-level API and only used for spatial 
join.
   > 
   > Most importantly, given polygon data, the spatial partitioned RDD will 
have duplicates because some polygons will cross the boundaries of multiple 
partitions and we duplicate those to overlapping partitions. Our spatial join 
algorithm will automatically de-dup after getting the join result.
   
   @jiayuasu What I really want to do is write out a large geoparquet dataset 
where the individual parquet files are spatially partitioned intelligently. 
This will improve performance of remote spatial queries by bounding box. We 
have some solutions now to split by geohash/quadkey, but a partitioning scheme 
backed by a kdb-tree / r-tree / etc would be better. The fact that polygons' 
extents will cause overlaps of the spatial partitions is fine but we do need to 
assign each row to only one partition. I was hoping there was a way to use 
`df.repartition` with the spatial rdd's partitioner to make it all work. But 
let me know if this is not the right use for this.
   
    
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@sedona.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] Preserve Spatial Partitioning From RDD to Dataframe [sedona]

Reply via email to