jwass commented on issue #1268: URL: https://github.com/apache/sedona/issues/1268#issuecomment-1981054310
> @jwass Is there a reason why you want to use the Sedona rdd-based spatial partitioning? This is considered as low-level API and only used for spatial join. > > Most importantly, given polygon data, the spatial partitioned RDD will have duplicates because some polygons will cross the boundaries of multiple partitions and we duplicate those to overlapping partitions. Our spatial join algorithm will automatically de-dup after getting the join result. @jiayuasu What I really want to do is write out a large geoparquet dataset where the individual parquet files are spatially partitioned intelligently. This will improve performance of remote spatial queries by bounding box. We have some solutions now to split by geohash/quadkey, but a partitioning scheme backed by a kdb-tree / r-tree / etc would be better. The fact that polygons' extents will cause overlaps of the spatial partitions is fine but we do need to assign each row to only one partition. I was hoping there was a way to use `df.repartition` with the spatial rdd's partitioner to make it all work. But let me know if this is not the right use for this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@sedona.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org