[I] Preserve Spatial Partitioning From RDD to Dataframe [sedona]

via GitHub Tue, 05 Mar 2024 10:59:31 -0800


jwass opened a new issue, #1268:
URL: https://github.com/apache/sedona/issues/1268


   Is there a way to spatially partition a dataframe, presumably by first 
converting to rdd then back, and write it out using that partitioning scheme? 
This is my guess as to how to accomplish this but I'm not sure if I'm 
misunderstanding things... I'm also relatively new to working with Spark and 
Sedona.
   
   ## Expected behavior
   
   Loading a dataframe, converting to rdd, spatially partition it, convert back 
to dataframe, and save the result - I'd expect the final dataframe partitioning 
to be preserved from the rdd.
   
   ## Actual behavior
   
   Adapter.toDf() does not preserve partitioning - or I'm doing something else 
wrong.
   
   ## Steps to reproduce the problem
   
   ```
   df =  sedona.read.format("geoparquet").load(path)
   rdd = Adapter.toSpatialRdd(df, "geometry")
   rdd.analyze()
   rdd.spatialPartitioning(GridType.KDBTREE, num_partitions=6)
   
   df2 = Adapter.toDf(rdd, spark)
   df2.write.format("geoparquet").save(output_path)
   ```
   But it looked like that doesn't work - number of partitions written in df2 
was far greater than 6.
   
   ## Settings
   
   Sedona version = 1.5.1
   
   Apache Spark version = ?
   
   API type = Python
   
   Python version = ?
   
   Environment = Databricks
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Preserve Spatial Partitioning From RDD to Dataframe [sedona]

Reply via email to