I have a question about distribution of data to the segments for the
various graph processing algos we are building.

Do we have guidance for users on how to distribute data?

Does the strategy vary by algorithm?

What impact will data distribution have on performance?

Looking at Section 4.1 of the Pregel paper
https://kowshik.github.io/JPregel/pregel_paper.pdf
it has a default partitioning scheme of hash(ID)
mod N, where N is the number of partitions.  But then it says

“Some applications work well with the default assignment, but some
benefit from defining custom assignment functions to better
exploit locality inherent in the graph. For example, a typical
heuristic employed for the Web graph is to colocate vertices
representing pages of the same site.”

Frank

Reply via email to