ericsun95 commented on PR #37053:
URL: https://github.com/apache/spark/pull/37053#issuecomment-1179057359

   > Thanks for the PR. Can you help justify that this is better than using 
EdgePartition2D for those use cases? As to the change, I'm personally not in 
favor of deprecating EdgePartition1D just to change its name.
   
   Hey, sure. I have experimented with a topological sort like algorithm in a 
big graph (billions of vertices and edges). The new `EdgePartition1DDst` can 
have 36% of performance improvement than the `EdgePartition2D`. And it is more 
stable under different hardware configurations (executor instances types, 
number of executors, etc). Other partition strategies could not even finish the 
computation.
   
   In theory, assuming we need to collect messages sending from child nodes to 
parent nodes iteratively. It is necessary to make sure most data has the same 
data locality (thus less shuffle). The `EdgePartition2D` could only make sure 
that there are about `1/3` changes the edges with same dst located in the same 
partition while the `EdgePartition1DDst` can guarantee that most edges with the 
same dst id located together. It is more obvious if the graph has a tree 
structure.
   
   For the naming, I do think it's better to split `EdgePartition1D` to 
`EdgePartition1DSrc` and `EdgePartition1DDst` just to be consistent on naming 
instead of `EdgePartition1D` and `EdgePartition1DDst` which is confusing to 
people without background. And I found that the usage of `EdgePartition1D` in 
the repo is minimal so it wouldn't be a breaking change. However, I am fine to 
remove the deprecation if there are concerns on this change.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to