Hi all, I've been playing around with Sedona a little bit, and first want to say thanks to all the work that has gone into this! I was able to get a spatial join working on several hundred billion points in under 30 minutes (time is relatively based on resources obviously). I did have to use the RDD API directly to get around the bug I made a PR <https://github.com/apache/incubator-sedona/pull/511> for. I wanted to share some initial thoughts/questions/ideas after looking through the code and playing around with it. While it would be cool to try to work on some of these features, I probably won't have time to dedicate to it, but wanted to bring them up at least in case I do or someone else wants to work them.
- The dynamic index join is the only join type that handles large datasets remotely well by creating an iterator. You only have up to the number of windows you have worth of rows in memory at one time. The precomputed index joins will keep all results in memory until you are done with the partition, while the nested loop join loads all points into memory before starting to join (which is what currently prevents SQL joins from working on large datasets at all). Possible options are: - Use the dynamic index approach of defining an iterator to handle things lazily. Might offer the best performance but adds a fair amount of complexity to do properly. - Use Java streams. It's a very simple flatMap/map operation, but I don't know the performance impact. - Use Scala iterators for the same flatMap/map operation which is what's used in a lot of the Spark code, not sure if it's more performant than Java streams? - I saw there was already talk about a broadcast join, that would be pretty cool. - The only way to configure join parameters from SQL is through the static Spark configuration. It would be cool to be able to change those dynamically to test out different options for performance without having to recreate your spark context. Options: - Use the SQL RuntimeConfig to build the SedonaConf before a join. - Use join hints to somehow override these configs. Absolutely no idea if this is possible or how complicated it would be but it sounds cool. - Package the sedona python package in the sedona-python-adapter jar. This is what Delta OSS does and makes it easy to use the Python modules by just including the jar as a package. Obviously the attrs and shapely dependencies make this trickier, so might not be as feasible. Curious what others think. -- Adam