ericsun95 commented on PR #37053: URL: https://github.com/apache/spark/pull/37053#issuecomment-1179057359
> Thanks for the PR. Can you help justify that this is better than using EdgePartition2D for those use cases? As to the change, I'm personally not in favor of deprecating EdgePartition1D just to change its name. Hey, sure. I have experimented with a topological sort like algorithm in a big graph (billions of vertices and edges). The new `EdgePartition1DDst` can have 36% of performance improvement than the `EdgePartition2D`. And it is more stable under different hardware configurations (executor instances types, number of executors, etc). Other partition strategies could not even finish the computation. In theory, assuming we need to collect messages sending from child nodes to parent nodes iteratively. It is necessary to make sure most data has the same data locality (thus less shuffle). The `EdgePartition2D` could only make sure that there are about `1/3` changes the edges with same dst located in the same partition while the `EdgePartition1DDst` can guarantee that most edges with the same dst id located together. It is more obvious if the graph has a tree structure. For the naming, I do think it's better to split `EdgePartition1D` to `EdgePartition1DSrc` and `EdgePartition1DDst` just to be consistent on naming instead of `EdgePartition1D` and `EdgePartition1DDst` which is confusing to people without background. And I found that the usage of `EdgePartition1D` in the repo is minimal so it wouldn't be a breaking change. However, I am fine to remove the deprecation if there are concerns on this change. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org